[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] Fix error: array subscript has type 'char'



On Thu, Jan 14, 2021 at 02:25:05PM +0100, Jan Beulich wrote:
> On 14.01.2021 13:29, Manuel Bouyer wrote:
> > On Thu, Jan 14, 2021 at 11:53:20AM +0100, Jan Beulich wrote:
> >> On 12.01.2021 19:12, Manuel Bouyer wrote:
> >>> From: Manuel Bouyer <bouyer@xxxxxxxxxx>
> >>>
> >>> Use unsigned char variable, or cast to (unsigned char), for
> >>> tolower()/islower() and friends. Fix compiler error
> >>> array subscript has type 'char' [-Werror=char-subscripts]
> >>
> >> Isn't this something that wants changing in your ctype.h instead?
> >> the functions (or macros), as per the C standard, ought to accept
> >> plain char aiui, and if they use the input as an array subscript,
> >> it should be their implementation suitably converting type first.
> > 
> > I asked for inputs from NetBSD developers familiar with this part.
> > 
> > Although the parameter is an int, only a subset of values is valid,
> > as stated in ISO C 2018 (Section 7.4 paragrah 1):
> >> In all cases the argument is an int, the value of which shall be
> >> representable as an unsigned char or shall equal the value of the
> >> macro EOF.  If the argument has any other value, the behavior is 
> >> undefined.                               
> 
> Which means you're shifting the undefined-ness from the implementation
> (using the value as array index) to the callers (truncating values, or
> converting value range). In particular I do not think that the
> intention behind the standard's wording is for every caller to need to
> cast to unsigned char, when e.g. character literals are of type char
> and string literals are of type const char[].

If you don't cast you fall into the undefined behavior case for non-ascii
characters. For example, "é" in iso-latin-1 is 0xe9. In a signed char, this is
-23 (decimal). without the cast, tolower() will see -23.
If it is casted to (unsigned char) first, tolower() will see 233, as expected.
The following test program illustrates this:
armandeche:/tmp>cat toto.c
#include <stdio.h>

int
main(int _c, const char *_v[])
{
        char c = 0xe9;
        printf("%d %d\n", (int)c, (int)(unsigned char)c);
}
armandeche:/tmp>./toto 
-23 233



> 
> > As stated by NetBSD's ctype(3) manual page, NetBSD and glibc took different
> > approach. NetBSD emits a compile-time warning if the input may lead to
> > undefined behavior. quoting the man page:
> >      Some implementations of libc, such as glibc as of 2018, attempt to 
> > avoid
> >      the worst of the undefined behavior by defining the functions to work 
> > for
> >      all integer inputs representable by either unsigned char or char, and
> >      suppress the warning.  However, this is not an excuse for avoiding
> >      conversion to unsigned char: if EOF coincides with any such value, as 
> > it
> >      does when it is -1 on platforms with signed char, programs that pass 
> > char
> >      will still necessarily confuse the classification and mapping of EOF 
> > with
> >      the classification and mapping of some non-EOF inputs.
> > 
> > 
> > So, although no warning is emmited on linux, it looks like to me that the
> > cast to unsigned char is needed anyway, and relying on glibc's behavior
> > is not portable.
> 
> I fully agree we shouldn't rely on glibc's behavior (I'm sure
> you've looked at xen/include/xen/ctype.h to see how we address
> this it Xen itself, but I will admit this is to a degree comparing
> apples and oranges, not the least because the lack of a need to
> consider EOF in Xen). At least in xen/tools/symbols.c I don't
> think we're at risk of running into "undefined" space. Casts are

as long as there's only ascii characters.

Anyway NetBSD won't change its ctype.h

-- 
Manuel Bouyer <bouyer@xxxxxxxxxxxxxxx>
     NetBSD: 26 ans d'experience feront toujours la difference
--



 


Rackspace

Lists.xenproject.org is hosted with RackSpace, monitoring our
servers 24x7x365 and backed by RackSpace's Fanatical Support®.