Google Groups Home
Help | Sign in
is_ascii() or is_binary() for files?
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  17 messages - Collapse all
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
Brad  
View profile
 More options Jul 5, 3:13 pm
Newsgroups: comp.lang.c++
From: Brad <b...@16systems.com>
Date: Sat, 05 Jul 2008 15:13:45 -0400
Local: Sat, Jul 5 2008 3:13 pm
Subject: is_ascii() or is_binary() for files?
Is there a way to determine whether a file is plain ascii text or not
using standard C++?

    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
osmium  
View profile
 More options Jul 5, 3:22 pm
Newsgroups: comp.lang.c++
From: "osmium" <r124c4u...@comcast.net>
Date: Sat, 5 Jul 2008 12:22:21 -0700
Local: Sat, Jul 5 2008 3:22 pm
Subject: Re: is_ascii() or is_binary() for files?

"Brad" wrote:
> Is there a way to determine whether a file is plain ascii text or not
> using standard C++?

No.  It's in the eye of the beholder.  You can make a very good guess by
looking by counting control characters that wouldn't likely be in text.  But
the possibility exists that a binary file might not have any of them either.

    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Sherman Pendley  
View profile
 More options Jul 5, 3:22 pm
Newsgroups: comp.lang.c++
From: Sherman Pendley <spamt...@dot-app.org>
Date: Sat, 05 Jul 2008 15:22:51 -0400
Local: Sat, Jul 5 2008 3:22 pm
Subject: Re: is_ascii() or is_binary() for files?

Brad <b...@16systems.com> writes:
> Is there a way to determine whether a file is plain ascii text or not
> using standard C++?

Sure, just read its contents and look for any byte that's > 127. If
you find one, the file's contents are not plain ASCII.

sherm--

--
My blog: http://shermspace.blogspot.com
Cocoa programming in Perl: http://camelbones.sourceforge.net


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Medvedev  
View profile
 More options Jul 5, 3:45 pm
Newsgroups: comp.lang.c++
From: Medvedev <3D.v.Wo...@gmail.com>
Date: Sat, 5 Jul 2008 12:45:55 -0700 (PDT)
Local: Sat, Jul 5 2008 3:45 pm
Subject: Re: is_ascii() or is_binary() for files?
On Jul 5, 11:22 am, Sherman Pendley <spamt...@dot-app.org> wrote:

> Brad <b...@16systems.com> writes:
> > Is there a way to determine whether a file is plain ascii text or not
> > using standard C++?

> Sure, just read its contents and look for any byte that's > 127. If
> you find one, the file's contents are not plain ASCII.

if he try to test in a text file which contain non-English text , he
will fail!!
because non-English char are > 127

    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
red floyd  
View profile
 More options Jul 5, 3:58 pm
Newsgroups: comp.lang.c++
From: red floyd <no.spam.h...@example.com>
Date: Sat, 05 Jul 2008 12:58:09 -0700
Local: Sat, Jul 5 2008 3:58 pm
Subject: Re: is_ascii() or is_binary() for files?

Medvedev wrote:
> On Jul 5, 11:22 am, Sherman Pendley <spamt...@dot-app.org> wrote:
>> Brad <b...@16systems.com> writes:
>>> Is there a way to determine whether a file is plain ascii text or not
>>> using standard C++?
>> Sure, just read its contents and look for any byte that's > 127. If
>> you find one, the file's contents are not plain ASCII.

> if he try to test in a text file which contain non-English text , he
> will fail!!
> because non-English char are > 127

OP specified ASCII, not non-English text.

    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Medvedev  
View profile
 More options Jul 5, 4:01 pm
Newsgroups: comp.lang.c++
From: Medvedev <3D.v.Wo...@gmail.com>
Date: Sat, 5 Jul 2008 13:01:43 -0700 (PDT)
Local: Sat, Jul 5 2008 4:01 pm
Subject: Re: is_ascii() or is_binary() for files?
On Jul 5, 11:45 am, Medvedev <3D.v.Wo...@gmail.com> wrote:

> On Jul 5, 11:22 am, Sherman Pendley <spamt...@dot-app.org> wrote:

> > Brad <b...@16systems.com> writes:
> > > Is there a way to determine whether a file is plain ascii text or not
> > > using standard C++?

> > Sure, just read its contents and look for any byte that's > 127. If
> > you find one, the file's contents are not plain ASCII.

> if he try to test in a text file which contain non-English text , he
> will fail!!
> because non-English char are > 127

sorry man , u r right
i found non-English represented by negative sign
and binary is the file which it's byte MAY BE > 127
as it can hold 256-bit pattern

source:
http://www.cs.umd.edu/class/sum2003/cmsc311/Notes/BitOp/asciiBin.html


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Sherman Pendley  
View profile
 More options Jul 5, 4:04 pm
Newsgroups: comp.lang.c++
From: Sherman Pendley <spamt...@dot-app.org>
Date: Sat, 05 Jul 2008 16:04:05 -0400
Local: Sat, Jul 5 2008 4:04 pm
Subject: Re: is_ascii() or is_binary() for files?

Medvedev <3D.v.Wo...@gmail.com> writes:
> On Jul 5, 11:22 am, Sherman Pendley <spamt...@dot-app.org> wrote:
>> Brad <b...@16systems.com> writes:
>> > Is there a way to determine whether a file is plain ascii text or not
>> > using standard C++?

>> Sure, just read its contents and look for any byte that's > 127. If
>> you find one, the file's contents are not plain ASCII.

> if he try to test in a text file which contain non-English text , he
> will fail!!

Exactly as it should.

> because non-English char are > 127

In other words, they're not plain ASCII. :-)

sherm--

--
My blog: http://shermspace.blogspot.com
Cocoa programming in Perl: http://camelbones.sourceforge.net


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
James Kanze  
View profile
 More options Jul 5, 6:28 pm
Newsgroups: comp.lang.c++
From: James Kanze <james.ka...@gmail.com>
Date: Sat, 5 Jul 2008 15:28:31 -0700 (PDT)
Local: Sat, Jul 5 2008 6:28 pm
Subject: Re: is_ascii() or is_binary() for files?
On Jul 5, 9:45 pm, Medvedev <3D.v.Wo...@gmail.com> wrote:

> On Jul 5, 11:22 am, Sherman Pendley <spamt...@dot-app.org> wrote:
> > Brad <b...@16systems.com> writes:
> > > Is there a way to determine whether a file is plain ascii text or not
> > > using standard C++?
> > Sure, just read its contents and look for any byte that's > 127. If
> > you find one, the file's contents are not plain ASCII.
> if he try to test in a text file which contain non-English
> text , he will fail!!  because non-English char are > 127

ASCII is a seven bit code, so no characters are greater than
127 in it.

Of course, just because you don't find any characters greater
than 127 doesn't mean that it is ASCII.  It could still be ISO
8859-1, or UTF-8, in which, by chance, none of the characters
happen to be greater than 127.  (Or it could be that plain char
is signed on your machine, in which case, it can't contain a
value greater that 127, regardless of the encoding:-).)

--
James Kanze (GABI Software)             email:james.ka...@gmail.com
Conseils en informatique orientée objet/
                   Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Brad  
View profile
 More options Jul 5, 8:48 pm
Newsgroups: comp.lang.c++
From: Brad <b...@16systems.com>
Date: Sat, 05 Jul 2008 20:48:59 -0400
Local: Sat, Jul 5 2008 8:48 pm
Subject: Re: is_ascii() or is_binary() for files?
Stefan Ram wrote:
> Brad <b...@16systems.com> writes:
>> Is there a way to determine whether a file is plain ascii text
>> or not using standard C++?

>   If someone can define in words when a file is deemed to be a
>   »a plain ascii text« without ambiguity and for each possible
>   file, I am sure that then this newsgroup will be able to
>   help to implement a test for it in C++.

 > ...

Thanks for all the responses. The program recurses through a directory
processing files. I do not know beforehand what type of files the
program may encounter. The processing is simply reading the file and
passing its content to a regular expression to search for certain strings.

Binary files cause problems, so I thought if I could just skip them and
only read ASCII and perhaps UTF-8 encoded files, things would be better.
That lead to my initial question. Later I could learn how to deal with
binary files that I may want to search like PDF and MS Office documents.
Just curious if standard C++ had some built-in function that made this easy.

Thanks again,

Brad


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Sam  
View profile
 More options Jul 5, 9:52 pm
Newsgroups: comp.lang.c++
From: Sam <s...@email-scan.com>
Date: Sat, 05 Jul 2008 20:52:44 -0500
Local: Sat, Jul 5 2008 9:52 pm
Subject: Re: is_ascii() or is_binary() for files?

Brad writes:
> That lead to my initial question. Later I could learn how to deal with
> binary files that I may want to search like PDF and MS Office documents.
> Just curious if standard C++ had some built-in function that made this easy.

No. The only 'built-in' function of any kind is one to test if a single
character belongs in a given character class: isascii() and its equivalents.
It's up to you to scan the entire contents of the file, to classify it.

In POSIX, you might be able to get away with opening a file, stat()ing its
contents, to get the file's size, mmap-ing the file into memory, then using
std::find_if() to search for non-ascii bytes. Of course, if you hit a 4gb
file, that might cause ...problems.

  application_pgp-signature_part
< 1K Download

    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Erik Wikström  
View profile
 More options Jul 6, 5:18 am
Newsgroups: comp.lang.c++
From: Erik Wikström <Erik-wikst...@telia.com>
Date: Sun, 06 Jul 2008 09:18:34 GMT
Local: Sun, Jul 6 2008 5:18 am
Subject: Re: is_ascii() or is_binary() for files?
On 2008-07-06 02:48, Brad wrote:

The simplest way to solve your problem is probably to impose some
additional constraints, such as requiring that text files have a name
ending with ".txt" or that you only guarantee correct operation if no
none ASCII files are in the directory.

If you are running on a POSIX system you can also use the 'file' program
which tries to figure out what kind of contents a file has.

--
Erik Wikström


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
James Kanze  
View profile
 More options Jul 6, 6:29 am
Newsgroups: comp.lang.c++
From: James Kanze <james.ka...@gmail.com>
Date: Sun, 6 Jul 2008 03:29:34 -0700 (PDT)
Local: Sun, Jul 6 2008 6:29 am
Subject: Re: is_ascii() or is_binary() for files?
On Jul 6, 3:52 am, Sam <s...@email-scan.com> wrote:

> Brad writes:
> > That lead to my initial question. Later I could learn how to
> > deal with binary files that I may want to search like PDF
> > and MS Office documents.  Just curious if standard C++ had
> > some built-in function that made this easy.
> No. The only 'built-in' function of any kind is one to test if
> a single character belongs in a given character class:
> isascii() and its equivalents.  It's up to you to scan the
> entire contents of the file, to classify it.

There is no isascii function, and the other isxxx functions are
locale dependent (and don't really work for narrow characters
anyway).  There are heuristics for "guessing" the type of
contents of a file, but they're just that, heuristics, and none
are 100% certain.

Most systems have various conventions which may reveal the type,
but those are also just conventions, and individual files may
actually violate them: you can give a text file an name ending
with .exe under Windows, and there's nothing to prevent a binary
file from starting with something that looks like like
"<!DOCTYPE..." on any system.

> In POSIX, you might be able to get away with opening a file,
> stat()ing its contents, to get the file's size, mmap-ing the
> file into memory, then using std::find_if() to search for
> non-ascii bytes. Of course, if you hit a 4gb file, that might
> cause ...problems.

Under most Unix systems, you'd probably read the first N bytes
(maybe 512, although that's a lot more than would typically be
necessary), and then exploit magic.  For that matter,
*generally*, reading the first 512 bytes, then looking for
characters outside the set 0x07-0x0D and 0x20-0x7E, is probably
a pretty good heuristic; the probability of your guessing wrong
is pretty slim (but of course, it will treat non-ascii text
files as binary).

--
James Kanze (GABI Software)             email:james.ka...@gmail.com
Conseils en informatique orientée objet/
                   Beratung in objektorientierter Datenverarbeitung
9 place Sémard, 78210 St.-Cyr-l'École, France, +33 (0)1 30 23 00 34


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission requi