Comments on the new C Standard

Jonathan Coxhead, 7 Jan 1998

I realise that at this stage substantive revisions are unlikely; however, this is the first time that public comment has been invited, so I feel that it my duty to express my comments and concerns regarding the proposed C Standard, however irrelevant they may turn out to be.

2 things strike me as remarkable about this proposed international standard. I think they are sufficiently important that I urge the U K delegation to vote NO at the ballot for this Draft Standard.


Formal semantics

The first is the lack of any kind of formal presentation of semantics. There is a substantial body of work on the formal description of languages (including C itself), and formal methods have proved valuable in many practical applications. Yet there is no attempt to provide a formal description at any level. I understand that this is because the committee feels that they lack the expertise to provide such a description; but does this not really represent a misprioritisation of effort?

Those who believe that formal semantics are purely academic exercises are living in a world founded purely on optimism. In a world where computers are used to control ever-increasing aspects of our lives, and where C is used ever-more ubiquitously as a part (and a crucial part) of those systems, it seems close to a kind of institutionalised insanity (or at the very least, an abdication of responsibility by the only group able to take that responsibility) not to insist that the foundation of those systems should be specified to the utmost limit of our competence to do so, where by `us´ I mean `the professional software community´. I am not claiming that C systems should become radically different: the semantics as defined in the Standard should respect existing practice just as much as any other part of the Standard. But surely we should be able to understand much more precisely exactly what we mean by a `C system´.

Going the whole way down a road to formality would not necessarily be useful: I do not claim that enormous tracts of algebra would be helpful for interpreters of the Standard. However, an approach involving the introduction of axioms and specification of the rules of the system would be fairly painless. Euclid´s Elements introduces all its concepts in this way, and in comparison, the C Standard falls woefully short. The semantics of the ALGOL 68 report go further in the same direction. (I do not speak of the 2-level grammar used to describe the syntax, which appears to provide more confusion than clarity, but of the paragraphs written in ordinary English entitled `Semantics´.) It seems to me that a modern programming language standard should at least be able to approach the clarity achieved by Euclid; and hopefully, to go beyond. In contrast, the Definitions and Conventions section of the Draft Standard is highly self-referential, and rather than serving as a precise introduction to a series of concepts, rather establishes relationships between them which accord with one´s intuition, once it has been developed, without providing a firm basis for logical reasoning about the terms it defines. After 24 centuries of progress since Euclid, it is hard to accept that this is the best we can do.


long long type

The second remarkable feature is the introduction of long long as a new integer type. This is such a short-sighted solution to the problem that it is designed to address that already we can see situations where it fails at both ends of its intended range: firstly, in that there are already architectures where a long long long type would be needed (or at least usefully expressive), and secondly, in that many architectures where C is used have no need or use for a 64-bit type. It also breaks the most important single rule that should, in my view, guide the formulation of a standard for any language called C: it is not compatible with the existing Standard.

I think this area needs a radical reappraisal, more along the lines of the way that function prototypes were introduced in the existing Standard: a leap that will show implementors the way, rather than an attempt to codify a confusing and muddled set of ad hoc existing practices. The remainder of this section describes my view of a possible basis for such a reappraisal. It is purely a sketch, leaving many problems unaddressed; however, I believe that it could be carried forward to give a much smaller, simpler C Standard than the one under ballot.

The basic idea would be to specify a ``base´´ or ``implementation´´ layer of the language, and then a mechanism (possibly a pragma, a header file, or some completely new syntax) to allow a programmer to bind the implementation to a particular ``application´´ of the language within a translation unit. I would imagine this normally being done as a ``compiler option´´.

The implementation layer would provide a set of integer types: one of 8 bits or more, one of 16 bits or more, and one of 32 bits or more (which might all be the same). These would have names of the form int:n, where n is the number of bits in the type (an integral constant expression). A typical implementation might provide int:8, int:16 and int:32; another might provide int:9 and int:36; another might provide only int:64. A bit-addressible architecture might choose to provide int:1. These types, however many of them there are, completely enumerate those available to the programmer, and they may be used in this form if required. (Each member of the set may be provided in hardware or in software or in any mixture.) There would always be an unsigned type corresponding to each signed type.

Having exposed all the integer sizes in the implementation layer, it is then the job of the ``application layer´´ to conceal them all again, giving us our compatibility with the existing Standard, and our flexibility for the future.

All the rest of the language and library would have to be visited, and mechanisms defined to enable all the facilities to work orthogonally with such a set of base types. This would be a long job (but there is some prior art in the transition from FORTRAN 66 ``intrinsic functions´´ to FORTRAN 77 ``specific´´ and ``generic´´ intrinsics). Promotion behaviours for arithmetic operations would have to be defined, and most library function would have to be extended: for example, there would have be a version of strlen() returning each integer type, the obvious names being strlen8(), strlen16(), strlen32() etc. The facilities provided should be syntactically ``simple´´: rather than providing macros which expand to printf() format strings, for example (the <inttypes.h> approach, which it seems to me is too cumbersome ever to enjoy widespread use), there should be a much easier-to-use convention (e g, "%,nd" for an n-bit type). Whatever approach was to be taken, all the facilities of the language would be available in the pure implementation layer. Clearly there would be no argument promotions in this language, for example, since they would be impossible, or at least difficult, to specify. There is extensive prior art in this area, not least the polymorphic functions of C++.

The application layer would then provide a facility to ``bind´´ the implementation layer to a language identical to existing Standard C. As I described it above, the syntax for specifying which binding is required is not clear: if it were done by a pragma, for instance, we could imagine that

      #pragma I16LP32
or something roughly equivalent, would specify a binding where int was bound to int:16, long to int:32, strlen() to strlen32(), "%d" to "%,16d", "%l" to "%,32d" etc. All the constraints of Standard C would apply for each binding: int would always be the fastest type, long the longest type used by any standard typedef, etc. An ``integer constant´´ (syntactic concept) would have always have type int (semantic concept), in the absence of overflow. It is important that the binding is only relevant at the translation-unit level---a programme can happily use strlen32() and strlen16(), either by calling strlen() in the presence of different bindings, or just by calling them directly. int can represent a different type in a different translation unit, and int in one translation unit might be the same type as long in another. (There is one area which might be called an incompatibility with C, but is not: in a binding where int and long are both int:32, for example, it would not be a type error to use one where the other was intended (they are the same type). However, it is not really an incompatibility, as it has no effect for correct programmes.)

A programme composed entirely of existing C-compatible syntax in translation units all translated with the same binding would under this scheme always be compatible with existing C. If the programmer mixes translation units with different bindings, some incompatibilities might arise (since a variable defined in one translation unit might be longer than long in another). However, it is easy to imagine situations where this would be a useful thing to do, and it could never happen by accident.

Having taken this huge leap with the integer types, the same job should be repeated for floating point and complex types: an implementation set of types (e g float:32, float:64 and float:80 or whatever) provided by the implementation, together with bindings to conceal the sizes. The <tgmath.h> facilities would be superseded and replaced by these ideas. (In fact, this whole scheme could be seen as a proposal to extend the ideas of <tgmath.h> to cover all integer, floating point and complex facilities uniformly.)

It would also be possible to offer a corresponding simplification in the area of character handling by using the fact that there are 2 representations for characters, plain and wide, for each of which corresponding facilities are provided. (There is also a third, multibyte, which is not treated uniformly with the others.) A binding facility would be introduced so that the various string handling functions were generic in the type of their argument: the generic call len (s) would invoke the correct one of strlen() or wcslen(). The scheme could be extended to include multibyte characters by introducing a type for them---let´s say, mbchar_t---and making the character handling functions cope with any of the 3 different types. This would allow, e g, the familiar printf() function to be used for any of the 3 types.

The functions memmove()/memcpy() also differ in the types of their arguments, so could be polymorphically merged along the lines described above.

An approach like this would have provide the following key features:

Small (32-bit) programmes could easily be converted into 64-bit programmes just by rebinding them to an application layer (in the sense above) with 64-bit longs; more-sophisticated programmes might choose to have i/o done at 64-bit bindings, but most calculation done at 32-bit, for example. I think it would also result in a language where correct programmes were more often written by beginners and casual programmers, who do comprise a substantial part of the C language´s user-base, while also allowing experts and academics a sound basis for creating, understanding and modifying complex systems in C.

Conclusion

Because of these issues, I urge the U K delegation to vote NO at the ballot for this Draft Standard.


  /|  Jonathan Coxhead
 (_|/ Origin Technology in Business, Inc
  /|  2518 Mission College Blvd #101
 (_/  Santa Clara, CA 95054-1215