 |
AppletTalk.com Java discussions newsgroups
|
| View previous topic :: View next topic |
| Author |
Message |
Guest
|
Posted: Tue Mar 28, 2006 11:12 pm Post subject: auto-detecting the character set encoding of a text file |
|
|
Hi,
I just wanted to say that I'm new here, so I excuse myself directly in
case I make any mistake :)
My problem is that I have a bunch of text files with various
character-set encodings, and I would need a method for detecting what
encoding a certain file uses. (so that I can later open that file and
begin reading from it, using the correct encoding)
Is there some way I can do this? Some of the encodings I suspect I will
come across are UTF-8, windows-1252 and ISO-8859-15, although I do not
know that no others might be present.
/Martin Gerner |
|
| Back to top |
|
 |
Roedy Green Guest
|
Posted: Wed Mar 29, 2006 2:12 am Post subject: Re: auto-detecting the character set encoding of a text file |
|
|
On 28 Mar 2006 14:35:13 -0800, martin.gerner (AT) gmail (DOT) com wrote, quoted
or indirectly quoted someone who said :
| Quote: | Is there some way I can do this? Some of the encodings I suspect I will
come across are UTF-8, windows-1252 and ISO-8859-15, although I do not
know that no others might be present.
|
Nothing simple like an encoding field. See
http://mindprod.com/projects/encodingidentification.html
for some approaches.
--
Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching. |
|
| Back to top |
|
 |
Martin Gerner Guest
|
Posted: Wed Mar 29, 2006 1:13 pm Post subject: Re: auto-detecting the character set encoding of a text file |
|
|
Roedy Green <my_email_is_posted_on_my_website (AT) munged (DOT) invalid> wrote in
news:cjqj22lm5dkd0e013odl3vnd8rt9ao4cdb (AT) 4ax (DOT) com:
| Quote: | On 28 Mar 2006 14:35:13 -0800, martin.gerner (AT) gmail (DOT) com wrote, quoted
or indirectly quoted someone who said :
Is there some way I can do this? Some of the encodings I suspect I will
come across are UTF-8, windows-1252 and ISO-8859-15, although I do not
know that no others might be present.
Nothing simple like an encoding field. See
http://mindprod.com/projects/encodingidentification.html
for some approaches.
|
Unfortunately, this didn't help me much.. So I take it that there is no
nifty little class I can download that will do this detection for me?
To clarify, the files I will be working with are _not_ HTML or XML files,
but rather standard-text log files from IM clients.
/Martin Gerner |
|
| Back to top |
|
 |
Thomas Weidenfeller Guest
|
Posted: Wed Mar 29, 2006 2:12 pm Post subject: Re: auto-detecting the character set encoding of a text file |
|
|
martin.gerner (AT) gmail (DOT) com wrote:
| Quote: | My problem is that I have a bunch of text files with various
character-set encodings, and I would need a method for detecting what
encoding a certain file uses. (so that I can later open that file and
begin reading from it, using the correct encoding)
Is there some way I can do this? Some of the encodings I suspect I will
come across are UTF-8, windows-1252 and ISO-8859-15, although I do not
know that no others might be present.
|
You can't in a general way. You have to know the encodings to be sure.
You can apply some heuristics to guess an encoding. But it will be a guess.
/Thomas
--
The comp.lang.java.gui FAQ:
ftp://ftp.cs.uu.nl/pub/NEWS.ANSWERS/computer-lang/java/gui/faq
http://www.uni-giessen.de/faq/archiv/computer-lang.java.gui.faq/ |
|
| Back to top |
|
 |
Roedy Green Guest
|
Posted: Wed Mar 29, 2006 6:12 pm Post subject: Re: auto-detecting the character set encoding of a text file |
|
|
On Wed, 29 Mar 2006 13:13:40 +0000 (UTC), Martin Gerner
<martin.gerner (AT) nospam (DOT) com> wrote, quoted or indirectly quoted someone
who said :
| Quote: | Unfortunately, this didn't help me much.. So I take it that there is no
nifty little class I can download that will do this detection for me?
|
Exactly. It is a messy problem.
--
Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching. |
|
| Back to top |
|
 |
Roedy Green Guest
|
Posted: Wed Mar 29, 2006 7:12 pm Post subject: Re: auto-detecting the character set encoding of a text file |
|
|
On Wed, 29 Mar 2006 13:13:40 +0000 (UTC), Martin Gerner
<martin.gerner (AT) nospam (DOT) com> wrote, quoted or indirectly quoted someone
who said :
| Quote: | To clarify, the files I will be working with are _not_ HTML or XML files,
but rather standard-text log files from IM clients.
|
If you have control over the creating of these files, you could put
the encoding on the front of the file followed by a \n. That would
make your job much easier. Or you could tell everyone to use UTF-8
which would make the problem disappear.
You might also do it by tracking the source of the file. You figure
out manually which encoding each source uses over which date range.
The habit of not recording the encoding goes way back. The idea was
documents were local and all encoded the same way. You did not
exchange documents with others, of if you did, you exchanged a whole
tape full all the same, so again the problem of identification did not
come up.
--
Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching. |
|
| Back to top |
|
 |
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
|