Boston Linux & Unix (BLU) Home | Calendar | Mail Lists | List Archives | Desktop SIG | Hardware Hacking SIG
Wiki | Flickr | PicasaWeb | Video | Maps & Directions | Installfests | Keysignings
Linux Cafe | Meeting Notes | Blog | Linux Links | Bling | About BLU

BLU Discuss list archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Discuss] "M-" notation



On Fri, Jul 06, 2012 at 02:17:25PM -0400, Jerry Feldman wrote:
> >> Someone mentioned unicode. There are a number of unicode to ascii
> >> converters.
> > Can you elaborate?  I can't see how this would work unless the
> > "Unicode" file contained only a subset of Unicode which corresponds to
> > the 7-bit ASCII character set... 
[...]
> Normally, we thing of Unicode as 16-bit (UTF-16). It can be UTF-7 or
> UTF-8. A true ASCII file is 7-bits. 

Normally?  Well, I don't.  On Unix, Unicode usually means UTF-8, the
first 128 code points of which == ASCII.  There's that other system
that usually uses UTF-16, but I only use that to play games (and this
is a Unix list after all)...  =8^)

> It has been a while since I have played with encodings, but you
> certainly can express unicode in ASCII by encoding the exceptions as
> escape characters.

Challenge accepted!  On the off chance that Daniel finds this useful,
or anyone finds it remotely interesting, attached is a very short and
fairly crappy C program called wcat that does something closer to what
cat -v probably should do, at least when the locale is UTF-8, and I
think probably any time it's not "C".  Though, confirming that theory
involves more shuffling of my locale than I care to bother with.  I
didn't bother to hunt down the code, but I'm guessing that cat just
assumes Latin1, and maintains a table of (or just calculates) the
characters to replace based on that.

Now, there are lots of places where this could be more robust.  It
pretty much ignores most errors, and it could print carret notation
for control chars, etc.  Instead it just prints out the 32-bit hex
representation of the character code, whenever the character is not
printable and not white space.  A more ambitious version could map the
entire UTF-8 encoding to their symbolic names, and print those instead
when the character is deemed non-printable (though IIRC UTF-8 has
very few characters considered non-printable after the control
characters in basic ASCII).  I didn't want to spend more than an hour
on this. :)

I tested it by wcatting my unusual chars file, and a file I generated
containing the first 32 ASCII chars (also attached, in the extremely
off chance anyone wants to compare the output of this to cat -v).  I
was able to see all of the characters I expected to be able to see,
and got hex numbers for the control chars that don't translate to
white space, all as expected.  Of course, the files are UTF-8, so if
your environment isn't, it will still look like gobbledygook (prolly
lots of hex numbers with little reversed question marks, mostly),
though very different gobbledygook than what cat -v would produce.
Just like the cat program, you can use it to cat multiple files
together on the command line.

You might need -std=c99 when compiling, i.e.:

$ gcc -std=c99 -o wcat wcat.c

./wcat unusual_chars.txt control_chars.txt

-- 
Derek D. Martin    http://www.pizzashack.org/   GPG Key ID: 0xDFBEAD02
-=-=-=-=-
This message is posted from an invalid address.  Replying to it will result in
undeliverable mail due to spam prevention.  Sorry for the inconvenience.

-------------- next part --------------
#include <stdio.h>
#include <stdlib.h>
#include <wctype.h>
#include <wchar.h>
#include <locale.h>
#include <sys/mman.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>

int main(int argc, char **argv)
{
    char *locale;
    struct stat fdetails;
    wint_t ch;
    int fd;
    char *m;
    char *s;

    locale = setlocale(LC_ALL, "");
    /*printf("locale is %s\n", locale);*/

    argv++;
    while (*argv){
        fd = open(*argv, O_RDONLY);
        argv++;
        if (fd == -1) continue;
        if ( (fstat(fd, &fdetails)) == -1 ) continue;

        m = (char *)mmap(NULL, fdetails.st_size, PROT_READ, MAP_PRIVATE, fd, 0);
        if (m == MAP_FAILED) continue;
        s = m;
        while (s < m + fdetails.st_size) {
            /* ch is the wide char converted from the multi-byte char at s */
            s = s + mbtowc(&ch, s, (m + fdetails.st_size - s));
            /* if it's \0, mbtowc returns 0 bytes, but we need to move past */
            if (ch == 0) s = s + 1;
            /* print if it's printable (iswprint returns false for space */
            if (iswprint(ch) || iswspace(ch)){
                printf("%lc", ch);
            } else {
                /* otherwise, print the hex (wide chars are 32-bit ints) */
                printf("\\0x%08x", (wint_t)ch);
            }
        }
        munmap(m, fdetails.st_size);
        close(fd);
    }
    printf("\n");
    return 0;
}
-------------- next part --------------
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? 

? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?

? ? ? ? ? 

? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?

? ? ? ? ? ? ? 

? ? ? ? ? ? ? ? ? ? 

? ? ? ? ? ? 

? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?

? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? 

? ? ? ? ? ? ? 


-------------- next part --------------
	








BLU is a member of BostonUserGroups
BLU is a member of BostonUserGroups
We also thank MIT for the use of their facilities.

Valid HTML 4.01! Valid CSS!



Boston Linux & Unix / webmaster@blu.org