C++ FAQ Celebrating Twenty-One Years of the C++ FAQ!!!
(Click here for a personal note from Marshall Cline.)
Section 26:
[26.6] I'm sooooo confused. Would you please go over the rules about bytes, chars, and characters one more time?

Here are the rules:

  • The C++ language gives the programmer the impression that memory is laid out as a sequence of something C++ calls "bytes."
  • Each of these things that the C++ language calls a byte has at least 8 bits, but might have more than 8 bits.
  • The C++ language guarantees that a char* (char pointers) can address individual bytes.
  • The C++ language guarantees there are no bits between two bytes. This means every bit in memory is part of a byte. If you grind your way through memory via a char*, you will be able to see every bit.
  • The C++ language guarantees there are no bits that are part of two distinct bytes. This means a change to one byte will never cause a change to a different byte.
  • The C++ language gives you a way to find out how many bits are in a byte in your particular implementation: include the header <climits>, then the actual number of bits per byte will be given by the CHAR_BIT macro.

Let's work an example to illustrate these rules. The PDP-10 has 36-bit words with no hardware facility to address anything within one of those words. That means a pointer can point only at things on a 36-bit boundary: it is not possible for a pointer to point 8 bits to the right of where some other pointer points.

One way to abide by all the above rules is for a PDP-10 C++ compiler to define a "byte" as 36 bits. Another valid approach would be to define a "byte" as 9 bits, and simulate a char* by two words of memory: the first could point to the 36-bit word, the second could be a bit-offset within that word. In that case, the C++ compiler would need to add extra instructions when compiling code using char* pointers. For example, the code generated for *p = 'x' might read the word into a register, then use bit-masks and bit-shifts to change the appropriate 9-bit byte within that word. An int* could still be implemented as a single hardware pointer, since C++ allows sizeof(char*) != sizeof(int*).

Using the same logic, it would also be possible to define a PDP-10 C++ "byte" as 12-bits or 18-bits. However the above technique wouldn't allow us to define a PDP-10 C++ "byte" as 8-bits, since 8*4 is 32, meaning every 4th byte we would skip 4 bits. A more complicated approach could be used for those 4 bits, e.g., by packing nine bytes (of 8-bits each) into two adjacent 36-bit words. The important point here is that memcpy() has to be able to see every bit of memory: there can't be any bits between two adjacent bytes.

Note: one of the popular non-C/C++ approaches on the PDP-10 was to pack 5 bytes (of 7-bits each) into each 36-bit word. However this won't work in C or C++ since 5*7 = 35, meaning using char*s to walk through memory would "skip" a bit every fifth byte (and also because C++ requires bytes to have at least 8 bits).