AppletTalk.com Forum Index AppletTalk.com
Java discussions newsgroups
 
Archives   FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

auto-detecting the character set encoding of a text file

 
Post new topic   Reply to topic    AppletTalk.com Forum Index -> Java Help
View previous topic :: View next topic  
Author Message
Guest






PostPosted: Tue Mar 28, 2006 11:12 pm    Post subject: auto-detecting the character set encoding of a text file Reply with quote



Hi,

I just wanted to say that I'm new here, so I excuse myself directly in
case I make any mistake :)

My problem is that I have a bunch of text files with various
character-set encodings, and I would need a method for detecting what
encoding a certain file uses. (so that I can later open that file and
begin reading from it, using the correct encoding)

Is there some way I can do this? Some of the encodings I suspect I will
come across are UTF-8, windows-1252 and ISO-8859-15, although I do not
know that no others might be present.

/Martin Gerner
Back to top
Roedy Green
Guest





PostPosted: Wed Mar 29, 2006 2:12 am    Post subject: Re: auto-detecting the character set encoding of a text file Reply with quote



On 28 Mar 2006 14:35:13 -0800, martin.gerner (AT) gmail (DOT) com wrote, quoted
or indirectly quoted someone who said :

Quote:
Is there some way I can do this? Some of the encodings I suspect I will
come across are UTF-8, windows-1252 and ISO-8859-15, although I do not
know that no others might be present.

Nothing simple like an encoding field. See
http://mindprod.com/projects/encodingidentification.html
for some approaches.
--
Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.
Back to top
Martin Gerner
Guest





PostPosted: Wed Mar 29, 2006 1:13 pm    Post subject: Re: auto-detecting the character set encoding of a text file Reply with quote



Roedy Green <my_email_is_posted_on_my_website (AT) munged (DOT) invalid> wrote in
news:cjqj22lm5dkd0e013odl3vnd8rt9ao4cdb (AT) 4ax (DOT) com:

Quote:
On 28 Mar 2006 14:35:13 -0800, martin.gerner (AT) gmail (DOT) com wrote, quoted
or indirectly quoted someone who said :

Is there some way I can do this? Some of the encodings I suspect I will
come across are UTF-8, windows-1252 and ISO-8859-15, although I do not
know that no others might be present.

Nothing simple like an encoding field. See
http://mindprod.com/projects/encodingidentification.html
for some approaches.

Unfortunately, this didn't help me much.. So I take it that there is no
nifty little class I can download that will do this detection for me?

To clarify, the files I will be working with are _not_ HTML or XML files,
but rather standard-text log files from IM clients.

/Martin Gerner
Back to top
Thomas Weidenfeller
Guest





PostPosted: Wed Mar 29, 2006 2:12 pm    Post subject: Re: auto-detecting the character set encoding of a text file Reply with quote

martin.gerner (AT) gmail (DOT) com wrote:
Quote:
My problem is that I have a bunch of text files with various
character-set encodings, and I would need a method for detecting what
encoding a certain file uses. (so that I can later open that file and
begin reading from it, using the correct encoding)

Is there some way I can do this? Some of the encodings I suspect I will
come across are UTF-8, windows-1252 and ISO-8859-15, although I do not
know that no others might be present.

You can't in a general way. You have to know the encodings to be sure.
You can apply some heuristics to guess an encoding. But it will be a guess.

/Thomas
--
The comp.lang.java.gui FAQ:
ftp://ftp.cs.uu.nl/pub/NEWS.ANSWERS/computer-lang/java/gui/faq
http://www.uni-giessen.de/faq/archiv/computer-lang.java.gui.faq/
Back to top
Roedy Green
Guest





PostPosted: Wed Mar 29, 2006 6:12 pm    Post subject: Re: auto-detecting the character set encoding of a text file Reply with quote

On Wed, 29 Mar 2006 13:13:40 +0000 (UTC), Martin Gerner
<martin.gerner (AT) nospam (DOT) com> wrote, quoted or indirectly quoted someone
who said :

Quote:
Unfortunately, this didn't help me much.. So I take it that there is no
nifty little class I can download that will do this detection for me?

Exactly. It is a messy problem.
--
Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.
Back to top
Roedy Green
Guest





PostPosted: Wed Mar 29, 2006 7:12 pm    Post subject: Re: auto-detecting the character set encoding of a text file Reply with quote

On Wed, 29 Mar 2006 13:13:40 +0000 (UTC), Martin Gerner
<martin.gerner (AT) nospam (DOT) com> wrote, quoted or indirectly quoted someone
who said :

Quote:
To clarify, the files I will be working with are _not_ HTML or XML files,
but rather standard-text log files from IM clients.

If you have control over the creating of these files, you could put
the encoding on the front of the file followed by a \n. That would
make your job much easier. Or you could tell everyone to use UTF-8
which would make the problem disappear.

You might also do it by tracking the source of the file. You figure
out manually which encoding each source uses over which date range.

The habit of not recording the encoding goes way back. The idea was
documents were local and all encoded the same way. You did not
exchange documents with others, of if you did, you exchanged a whole
tape full all the same, so again the problem of identification did not
come up.
--
Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.
Back to top
Display posts from previous:   
Post new topic   Reply to topic    AppletTalk.com Forum Index -> Java Help All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © 2001, 2006 phpBB Group
SEO toolkit © 2004-2006 webmedic.