programing

C 문자 리터럴이 문자 대신 정수인 이유는 무엇입니까?

nasanasas 2020. 8. 18. 07:47
반응형

C 문자 리터럴이 문자 대신 정수인 이유는 무엇입니까?


C ++에서 sizeof('a') == sizeof(char) == 1. 이것은 'a'문자 리터럴이고 sizeof(char) == 1표준에 정의 된대로 직관적으로 이해 됩니다.

그러나 C에서는 sizeof('a') == sizeof(int). 즉, C 문자 리터럴은 실제로 정수인 것처럼 보입니다. 이유를 아는 사람이 있습니까? 이 C 특성에 대한 많은 언급을 찾을 수 있지만 왜 존재하는지에 대한 설명은 없습니다.


같은 주제에 대한 토론

"보다 구체적으로 통합 프로모션입니다. K & R C에서는 문자 값을 int로 먼저 승격하지 않고는 문자 값을 사용하는 것이 사실상 (?) 불가능했기 때문에 처음에 문자를 상수 int로 만들면 해당 단계가 제거되었습니다. 여전히 여러 문자가 있습니다. 'abcd'와 같은 상수 또는 많은 수가 int에 적합합니다. "


원래 질문은 "왜?"입니다.

그 이유는 리터럴 문자의 정의가 진화하고 변경되면서 기존 코드와 하위 호환성을 유지하기 때문입니다.

초기 C의 어두운 날에는 유형이 전혀 없었습니다. C로 프로그래밍하는 법을 처음 배웠을 때 유형이 도입되었지만 함수에는 호출자에게 인수 유형이 무엇인지 알려주는 프로토 타입이 없었습니다. 대신 매개 변수로 전달되는 모든 것이 int의 크기 (모든 포인터를 포함)이거나 double이되도록 표준화되었습니다.

즉, 함수를 작성할 때 두 배가 아닌 모든 매개 변수는 선언 된 방식에 관계없이 스택에 int로 저장되었으며 컴파일러는이를 처리하기 위해 함수에 코드를 넣었습니다.

이로 인해 다소 일관성이 없었기 때문에 K & R이 유명한 책을 썼을 때 문자 리터럴은 함수 매개 변수뿐만 아니라 모든 표현식에서 항상 int로 승격된다는 규칙을 적용했습니다.

ANSI위원회가 C를 처음 표준화했을 때 문자 리터럴이 단순히 int가되도록이 규칙을 변경했습니다. 이는 동일한 작업을 수행하는 더 간단한 방법으로 보였기 때문입니다.

C ++를 설계 할 때 모든 함수는 완전한 프로토 타입을 가져야했습니다 (일반적으로 좋은 방법으로 받아 들여지지 만 C에서는 여전히 필요하지 않습니다). 이 때문에 문자 리터럴을 문자에 저장할 수있는 것으로 결정되었습니다. C ++에서 이것의 장점은 char 매개 변수가있는 함수와 int 매개 변수가있는 함수가 다른 서명을 갖는다는 것입니다. 이 장점은 C의 경우가 아닙니다.

이것이 그들이 다른 이유입니다. 진화...


C의 문자 리터럴이 int 유형 인 구체적인 이유를 모르겠습니다. 그러나 C ++에서는 그렇게하지 않는 좋은 이유가 있습니다. 이걸 고려하세요:

void print(int);
void print(char);

print('a');

print 호출이 문자를 사용하는 두 번째 버전을 선택한다고 예상 할 수 있습니다. 문자 리터럴이 int라는 것은 불가능합니다. C ++에서 문자가 두 개 이상인 리터럴에는 값이 구현이 정의되어 있지만 int 유형이 여전히 있습니다. 그래서, 'ab'유형이 int있지만, 'a'유형이 있습니다 char.


내 MacBook에서 gcc를 사용하여 다음을 시도합니다.

#include <stdio.h>
#define test(A) do{printf(#A":\t%i\n",sizeof(A));}while(0)
int main(void){
  test('a');
  test("a");
  test("");
  test(char);
  test(short);
  test(int);
  test(long);
  test((char)0x0);
  test((short)0x0);
  test((int)0x0);
  test((long)0x0);
  return 0;
};

실행하면 다음이 제공됩니다.

'a':    4
"a":    2
"":     1
char:   1
short:  2
int:    4
long:   4
(char)0x0:      1
(short)0x0:     2
(int)0x0:       4
(long)0x0:      4

이것은 당신이 의심하는 것처럼 문자가 8 비트라는 것을 암시하지만 문자 리터럴은 정수입니다.


C가 작성되었을 때 PDP-11의 MACRO-11 어셈블리 언어는 다음과 같습니다.

MOV #'A, R0      // 8-bit character encoding for 'A' into 16 bit register

This kind of thing's quite common in assembly language - the low 8 bits will hold the character code, other bits cleared to 0. PDP-11 even had:

MOV #"AB, R0     // 16-bit character encoding for 'A' (low byte) and 'B'

This provided a convenient way to load two characters into the low and high bytes of the 16 bit register. You might then write those elsewhere, updating some textual data or screen memory.

So, the idea of characters being promoted to register size is quite normal and desirable. But, let's say you need to get 'A' into a register not as part of the hard-coded opcode, but from somewhere in main memory containing:

address: value
20: 'X'
21: 'A'
22: 'A'
23: 'X'
24: 0
25: 'A'
26: 'A'
27: 0
28: 'A'

If you want to read just an 'A' from this main memory into a register, which one would you read?

  • Some CPUs may only directly support reading a 16 bit value into a 16 bit register, which would mean a read at 20 or 22 would then require the bits from 'X' be cleared out, and depending on the endianness of the CPU one or other would need shifting into the low order byte.

  • Some CPUs may require a memory-aligned read, which means that the lowest address involved must be a multiple of the data size: you might be able to read from addresses 24 and 25, but not 27 and 28.

So, a compiler generating code to get an 'A' into the register may prefer to waste a little extra memory and encode the value as 0 'A' or 'A' 0 - depending on endianness, and also ensuring it is aligned properly (i.e. not at an odd memory address).

My guess is that C's simply carried this level of CPU-centric behaviour over, thinking of character constants occupying register sizes of memory, bearing out the common assessment of C as a "high level assembler".

(See 6.3.3 on page 6-25 of http://www.dmv.net/dec/pdf/macro.pdf)


I remember reading K&R and seeing a code snippet that would read a character at a time until it hit EOF. Since all characters are valid characters to be in a file/input stream, this means that EOF cannot be any char value. What the code did was to put the read character into an int, then test for EOF, then convert to a char if it wasn't.

I realize this doesn't exactly answer your question, but it would make some sense for the rest of the character literals to be sizeof(int) if the EOF literal was.

int r;
char buffer[1024], *p; // don't use in production - buffer overflow likely
p = buffer;

while ((r = getc(file)) != EOF)
{
  *(p++) = (char) r;
}

I haven't seen a rationale for it (C char literals being int types), but here's something Stroustrup had to say about it (from Design and Evolution 11.2.1 - Fine-Grain Resolution):

In C, the type of a character literal such as 'a' is int. Surprisingly, giving 'a' type char in C++ doesn't cause any compatibility problems. Except for the pathological example sizeof('a'), every construct that can be expressed in both C and C++ gives the same result.

So for the most part, it should cause no problems.


This is the correct behavior, called "integral promotion". It can happen in other cases too (mainly binary operators, if I remember correctly).

EDIT: Just to be sure, I checked my copy of Expert C Programming: Deep Secrets, and I confirmed that a char literal does not start with a type int. It is initially of type char but when it is used in an expression, it is promoted to an int. The following is quoted from the book:

Character literals have type int and they get there by following the rules for promotion from type char. This is too briefly covered in K&R 1, on page 39 where it says:

Every char in an expression is converted into an int....Notice that all float's in an expression are converted to double....Since a function argument is an expression, type conversions also take place when arguments are passed to functions: in particular, char and short become int, float becomes double.


The historical reason for this is that C, and its predecessor B, were originally developed on various models of DEC PDP minicomputers with various word sizes, which supported 8-bit ASCII but could only perform arithmetic on registers. (Not the PDP-11, however; that came later.) Early versions of C defined int to be the native word size of the machine, and any value smaller than an int needed to be widened to int in order to be passed to or from a function, or used in a bitwise, logical or arithmetic expression, because that was how the underlying hardware worked.

That is also why the integer promotion rules still say that any data type smaller than an int is promoted to int. C implementations are also allowed to use one’s-complement math instead of two’s-complement for similar historical reasons. The reason that octal character escapes and octal constants are first-class citizens compared to hex is likewise that those early DEC minicomputers had word sizes divisible into three-byte chunks but not four-byte nibbles.


I don't know, but I'm going to guess it was easier to implement it that way and it didn't really matter. It wasn't until C++ when the type could determine which function would get called that it needed to be fixed.


I didn't know this indeed. Before prototypes existed, anything narrower than an int was converted to an int when using it as a function argument. That may be part of the explanation.


This is only tangential to the language spec, but in hardware the CPU usually only has one register size -- 32 bits, let's say -- and so whenever it actually works on a char (by adding, subtracting, or comparing it) there is an implicit conversion to int when it is loaded into the register. The compiler takes care of properly masking and shifting the number after each operation so that if you add, say, 2 to (unsigned char) 254, it'll wrap around to 0 instead of 256, but inside the silicon it is really an int until you save it back to memory.

It's sort of an academic point because the language could have specified an 8-bit literal type anyway, but in this case the language spec happens to reflect more closely what the CPU is really doing.

(x86 wonks may note that there is eg a native addh op that adds the short-wide registers in one step, but inside the RISC core this translates to two steps: add the numbers, then extend sign, like an add/extsh pair on the PowerPC)

참고URL : https://stackoverflow.com/questions/433895/why-are-c-character-literals-ints-instead-of-chars

반응형