programing

C에서 유니 코드 문자열의 문자를 계산하는 방법

nasanasas 2021. 1. 8. 08:19
반응형

C에서 유니 코드 문자열의 문자를 계산하는 방법


문자열이 있다고 가정 해 보겠습니다.

char theString[] = "你们好āa";

내 인코딩이 utf-8 인 경우이 문자열의 길이는 12 바이트입니다 (3 개의 한자 문자는 각각 3 바이트, 매크로가있는 라틴 문자는 2 바이트, 'a'는 1 바이트입니다.

strlen(theString) == 12

문자 수는 어떻게 세나요? 다음과 같이 첨자에 해당하는 것을 어떻게 할 수 있습니까?

theString[3] == "好"

그런 줄을 어떻게 자르고 고양이 할 수 있습니까?


상위 2 비트가로 설정되지 않은 문자 만 계산합니다 10(즉, 0x80보다 작거나 큰 모든 0xbf).

상위 2 비트가 설정된 모든 문자 10가 UTF-8 연속 바이트 이기 때문 입니다.

인코딩에 대한 설명과 UTF-8 문자열에서 작동 하는 방법 여기참조 하십시오strlen .

UTF-8 문자열을 슬라이싱하고 다이 싱하려면 기본적으로 동일한 규칙을 따라야합니다. 0비트 또는 11시퀀스로 시작하는 모든 바이트 는 UTF-8 코드 포인트의 시작이고 나머지는 모두 연속 문자입니다.

타사 라이브러리를 사용하지 않으려는 경우 가장 좋은 방법은 다음과 같은 기능을 제공하는 것입니다.

utf8left (char *destbuff, char *srcbuff, size_t sz);
utf8mid  (char *destbuff, char *srcbuff, size_t pos, size_t sz);
utf8rest (char *destbuff, char *srcbuff, size_t pos;

각각 :

  • sz문자열 의 왼쪽 UTF-8 바이트.
  • 에서 sz시작하는 문자열 UTF-8 바이트 pos.
  • 에서 시작하는 나머지 UTF-8 바이트 문자열 pos.

이것은 당신의 목적에 맞게 문자열을 충분히 조작 할 수있는 괜찮은 빌딩 블록이 될 것입니다.


가장 쉬운 방법은 ICU 와 같은 라이브러리를 사용하는 것입니다.


크기를 위해 이것을 시도하십시오 :

#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>

// returns the number of utf8 code points in the buffer at s
size_t utf8len(char *s)
{
    size_t len = 0;
    for (; *s; ++s) if ((*s & 0xC0) != 0x80) ++len;
    return len;
}

// returns a pointer to the beginning of the pos'th utf8 codepoint
// in the buffer at s
char *utf8index(char *s, size_t pos)
{    
    ++pos;
    for (; *s; ++s) {
        if ((*s & 0xC0) != 0x80) --pos;
        if (pos == 0) return s;
    }
    return NULL;
}

// converts codepoint indexes start and end to byte offsets in the buffer at s
void utf8slice(char *s, ssize_t *start, ssize_t *end)
{
    char *p = utf8index(s, *start);
    *start = p ? p - s : -1;
    p = utf8index(s, *end);
    *end = p ? p - s : -1;
}

// appends the utf8 string at src to dest
char *utf8cat(char *dest, char *src)
{
    return strcat(dest, src);
}

// test program
int main(int argc, char **argv)
{
    // slurp all of stdin to p, with length len
    char *p = malloc(0);
    size_t len = 0;
    while (true) {
        p = realloc(p, len + 0x10000);
        ssize_t cnt = read(STDIN_FILENO, p + len, 0x10000);
        if (cnt == -1) {
            perror("read");
            abort();
        } else if (cnt == 0) {
            break;
        } else {
            len += cnt;
        }
    }

    // do some demo operations
    printf("utf8len=%zu\n", utf8len(p));
    ssize_t start = 2, end = 3;
    utf8slice(p, &start, &end);
    printf("utf8slice[2:3]=%.*s\n", end - start, p + start);
    start = 3; end = 4;
    utf8slice(p, &start, &end);
    printf("utf8slice[3:4]=%.*s\n", end - start, p + start);
    return 0;
}

샘플 실행 :

matt@stanley:~/Desktop$ echo -n 你们好āa | ./utf8ops 
utf8len=5
utf8slice[2:3]=好
utf8slice[3:4]=ā

귀하의 예제에는 하나의 오류가 있습니다. theString[2] == "好"


Depending on your notion of "character", this question can get more or less involved.

First off, you should transform your byte string into a string of unicode codepoints. You can do this with iconv() of ICU, though if this is the only thing you do, iconv() is a lot easier, and it's part of POSIX.

Your string of unicode codepoints could be something like a null-terminated uint32_t[], or if you have C1x, an array of char32_t. The size of that array (i.e. its number of elements, not its size in bytes) is the number of codepoints (plus the terminator), and that should give you a very good start.

However, the notion of a "printable character" is fairly complex, and you may prefer to count graphemes rather than codepoints - for instance, an a with an accent ^ can be expressed as two unicode codepoints, or as a combined legacy codepoint â - both are valid, and both are required by the unicode standard to be treated equally. There is a process called "normalization" which turns your string into a definite version, but there are many graphemes which are not expressible as a single codepoint, and in general there is no way around a proper library that understands this and counts graphemes for you.

That said, it's up to you to decide how complex your scripts are and how thoroughly you want to treat them. Transforming into unicode codepoints is a must, everything beyond that is at your discretion.

Don't hesitate to ask questions about ICU if you decide that you need it, but feel free to explore the vastly simpler iconv() first.


In the real world, theString[3]=foo; is not a meaningful operation. Why would you ever want to replace a character at a particular position in the string with a different character? There's certainly no natural-language-text processing task for which this operation is meaningful.

Counting characters is also unlikely to be meaningful. How many characters (for your idea of "character") are there in "á"? How about "á"? Now how about "གི"? If you need this information for implementing some sort of text editing, you're going to have to deal with these hard questions, or just use an existing library/gui toolkit. I would recommend the latter unless you're an expert on world scripts and languages and think you can do better.

For all other purposes, strlen tells you exactly the piece of information that's actually useful: how much storage space a string takes. This is what's needed for combining and separating strings. If all you want to do is combine strings or separate them at a particular delimiter, snprintf (or strcat if you insist...) and strstr are all you need.

If you want to perform higher-level natural-language-text operations, like capitalization, line breaking, etc. or even higher-level operations like pluralization, tense changes, etc. then you'll need either a library like ICU or respectively something much higher-level and linguistically-capable (and specific to the language(s) you're working with).

Again, most programs do not have any use for this sort of thing and just need to assemble and parse text without any considerations to natural language.


while (s[i]) {
    if ((s[i] & 0xC0) != 0x80)
        j++;
    i++;
}
return (j);

This will count characters in a UTF-8 String... (Found in this article: Even faster UTF-8 character counting)

However I'm still stumped on slicing and concatenating?!?


In general we should use a different data type for unicode characters.

For example, you can use the wide char data type

wchar_t theString[] = L"你们好āa";

Note the L modifier that tells that the string is composed of wide chars.

The length of that string can be calculated using the wcslen function, which behaves like strlen.


One thing that's not clear from the above answers is why it's not simple. Each character is encoded in one way or another - it doesn't have to be UTF-8, for example - and each character may have multiple encodings, with varying ways to handle combining of accents, etc. The rules are really complicated, and vary by encoding (e.g., utf-8 vs. utf-16).

This question has enormous security concerns, so it is imperative that this be done correctly. Use an OS-supplied library or a well-known third-party library to manipulate unicode strings; don't roll your own.


I did similar implementation years back. But I do not have code with me.

For each unicode characters, first byte describes the number of bytes follow it to construct a unicode character. Based on the first byte you can determine the length of each unicode character.

I think its a good UTF8 library. enter link description here


A sequence of code points constitute a single syllable / letter / character in many other Non Western-European languages (eg: all Indic languages)

So, when you are counting the length OR finding the substring (there are definitely use cases of finding the substrings - let us say playing a hangman game), you need to advance syllable by syllable , not by code point by code point.

So the definition of the character/syllable and where you actually break the string into "chunks of syllables" depends upon the nature of the language you are dealing with. For example, the pattern of the syllables in many Indic languages (Hindi, Telugu, Kannada, Malayalam, Nepali, Tamil, Punjabi, etc.) can be any of the following

V  (Vowel in their primary form appearing at the beginning of the word)
C (consonant)
C + V (consonant + vowel in their secondary form)
C + C + V
C + C + C + V

You need to parse the string and look for the above patterns to break the string and to find the substrings.

I do not think it is possible to have a general purpose method which can magically break the strings in the above fashion for any unicode string (or sequence of code points) - as the pattern that works for one language may not be applicable for another letter;

I guess there may be some methods / libraries that can take some definition / configuration parameters as the input to break the unicode strings into such syllable chunks. Not sure though! Appreciate if some one can share how they solved this problem using any commercially available or open source methods.

ReferenceURL : https://stackoverflow.com/questions/7298059/how-to-count-characters-in-a-unicode-string-in-c

반응형