 |
AppletTalk.com Java discussions newsgroups
|
| View previous topic :: View next topic |
| Author |
Message |
kubek Guest
|
Posted: Wed Apr 05, 2006 10:12 pm Post subject: Loading big files into memory |
|
|
Hi.
I am dealing with a problem - in my software I need to check whether
written by user string or set of strings is spelled correctly. I have a
complete dictionary (Polish, text file of size of over 30mb), in which
I can check whether entered string is correct (if it exists in file).
Problem is that I can't load such a big file into memory (over 2,6
million lines). I have also modified the maximum heap size (java
-Xms16M -Xmx256M myAppl) and it worked, but...I have checked in
TaskManager and observed, that my application takes over 180mb of
memory - which in my opinion is quite much
Do You have any ideas how to change my text file (dictionary) - I don't
know, for example into binary file (if it makes sense), how to compress
or represent all 2,6 million strings in other way, that they all can be
loaded into memory in reasonable time and allocating reasonable amount
of memory?
If binary file is useful - how to "convert" my text file into it?
What is the best way to store it in memory - ArrayList, Collection or
something else?
Please help me, because it's urgent.
I need some ideas, key words, links where I can look for solution, or
any other help will be helpful
Thanks in advance
Chris |
|
| Back to top |
|
 |
Monique Y. Mudama Guest
|
Posted: Wed Apr 05, 2006 10:12 pm Post subject: Re: Loading big files into memory |
|
|
["Followup-To:" header set to comp.lang.java.help.] On 2006-04-05,
kubek penned:
| Quote: | Ok, let's say this is some solution. But still accessing the file
each time doesn't seem to be most optimal. What for example in case
of a Scrabble game, when not only checking the file content against
given string must be performed, but also returning all strings, that
can be created from word already placed on board and some letters
from letter stand? Can You see my point? Chris
|
Sure, but that's not the question you originally asked.
For every solution, there will be a situation in which that approach
is suboptimal.
--
monique
Ask smart questions, get good answers:
http://www.catb.org/~esr/faqs/smart-questions.html |
|
| Back to top |
|
 |
kubek Guest
|
Posted: Wed Apr 05, 2006 10:12 pm Post subject: Re: Loading big files into memory |
|
|
Ok, let's say this is some solution.
But still accessing the file each time doesn't seem to be most optimal.
What for example in case of a Scrabble game, when not only checking the
file content against given string must be performed, but also returning
all strings, that can be created from word already placed on board and
some letters from letter stand?
Can You see my point?
Chris |
|
| Back to top |
|
 |
Oliver Wong Guest
|
Posted: Wed Apr 05, 2006 10:12 pm Post subject: Re: Loading big files into memory |
|
|
"kubek" <kubek1982 (AT) o2 (DOT) pl> wrote in message
news:1144272769.923853.5860 (AT) i40g2000cwc (DOT) googlegroups.com...
| Quote: | Hi.
I am dealing with a problem - in my software I need to check whether
written by user string or set of strings is spelled correctly. I have a
complete dictionary (Polish, text file of size of over 30mb), in which
I can check whether entered string is correct (if it exists in file).
Problem is that I can't load such a big file into memory (over 2,6
million lines). I have also modified the maximum heap size (java
-Xms16M -Xmx256M myAppl) and it worked, but...I have checked in
TaskManager and observed, that my application takes over 180mb of
memory - which in my opinion is quite much
Do You have any ideas how to change my text file (dictionary) - I don't
know, for example into binary file (if it makes sense), how to compress
or represent all 2,6 million strings in other way, that they all can be
loaded into memory in reasonable time and allocating reasonable amount
of memory?
If binary file is useful - how to "convert" my text file into it?
What is the best way to store it in memory - ArrayList, Collection or
something else?
Please help me, because it's urgent.
I need some ideas, key words, links where I can look for solution, or
any other help will be helpful
Thanks in advance
|
Sort your dictionary file. Instead of loading the entire dictionary
file, do a sort of binary search through the file to see if the word you're
checking against exists in there. Don't actually ever load the whole file in
memory at once. A clever OS should detect your read patterns and cache parts
of the file when possible.
- Oliver |
|
| Back to top |
|
 |
kubek Guest
|
Posted: Wed Apr 05, 2006 11:12 pm Post subject: Re: Loading big files into memory |
|
|
Ok - my fault. Maybe I asked the question incorrectly or not precisely
enough.
However, I am more interested in solving the problem mentioned later:
| Quote: | But still accessing the file
each time doesn't seem to be most optimal. What for example in case
of a Scrabble game, when not only checking the file content against
given string must be performed, but also returning all strings, that
can be created from word already placed on board and some letters
from letter stand? Can You see my point?
|
Summarizing:
What to do with my over 30mb text file, so it can be stored in memory,
so I can search through its content as often as I want?
Is there any way to "squeze" my text file to obtain smaller file
without increasing cost of search operation?
What is the best structure to keep it in memory?
Please, help.
Sorry for not being precise enough.
Chris |
|
| Back to top |
|
 |
kubek Guest
|
Posted: Wed Apr 05, 2006 11:12 pm Post subject: Re: Loading big files into memory |
|
|
Ok - my fault. Maybe I asked the question incorrectly or not precisely
enough.
However, I am more interested in solving the problem mentioned later:
| Quote: | But still accessing the file
each time doesn't seem to be most optimal. What for example in case
of a Scrabble game, when not only checking the file content against
given string must be performed, but also returning all strings, that
can be created from word already placed on board and some letters
from letter stand? Can You see my point?
|
Summarizing:
What to do with my over 30mb text file, so it can be stored in memory,
so I can search through its content as often as I want?
Is there any way to "squeze" my text file to obtain smaller file
without increasing cost of search operation?
What is the best structure to keep it in memory?
Please, help.
Sorry for not being precise enough.
Chris |
|
| Back to top |
|
 |
Eric Sosman Guest
|
Posted: Wed Apr 05, 2006 11:12 pm Post subject: Re: Loading big files into memory |
|
|
kubek wrote On 04/05/06 17:53,:
| Quote: | Ok, let's say this is some solution.
But still accessing the file each time doesn't seem to be most optimal.
What for example in case of a Scrabble game, when not only checking the
file content against given string must be performed, but also returning
all strings, that can be created from word already placed on board and
some letters from letter stand?
|
Funny you should mention Scrabble ... Look for
"The world's fastest Scrabble program," A.W. Appel
and G.J. Jacobson, Communications of the ACM Volume
31, Issue 5 (May 1988)
.... which you can find on-line if hard copy isn't easy to
come by. Their program used a data structure they called
a Directed Acyclic Word Graph, or DAWG, that allowed them
to store a large dictionary in a small space and search it
rapidly.
How "small" and how "rapidly?" I don't recall the values
from their article, but I decided to implement their method
for my own amusement and was able to produce a program that
played Scrabble better than I do. Its dictionary held about
twenty-five or thirty thousand words, amounting to perhaps a
quarter megabyte in raw form.
You may well be raising your eyebrows at numbers like
"thirty thousand" and "a quarter meg," and thinking that this
DAWG thing isn't impressive at all. But you haven't heard
the punch line yet: The machine I used had 64kB of memory
(with about 10kB taken by the O/S), an eight-bit CPU operating
at 4MHz, and two floppy drives accommodating 390kB each -- and
that's all. The DAWG was good enough to let me fit my quarter
megabyte into less than one fifth its raw size and still search
it quickly, so I think it qualifies as top DAWG.
--
Eric.Sosman (AT) sun (DOT) com |
|
| Back to top |
|
 |
trahan Guest
|
Posted: Wed Apr 05, 2006 11:12 pm Post subject: Re: Loading big files into memory |
|
|
| Look at B trees. |
|
| Back to top |
|
 |
Roedy Green Guest
|
Posted: Thu Apr 06, 2006 1:12 am Post subject: Re: Loading big files into memory |
|
|
On 5 Apr 2006 14:32:51 -0700, "kubek" <kubek1982 (AT) o2 (DOT) pl> wrote, quoted
or indirectly quoted someone who said :
| Quote: | Do You have any ideas how to change my text file (dictionary) - I don't
know, for example into binary file (if it makes sense), how to compress
or represent all 2,6 million strings in other way, that they all can be
loaded into memory in
|
There is another way to do spell checking that was popular back in the
DOS days before we had enough RAM to wash our feet in.
You sort your dictionary and your text file (of word/offset pairs)
alphabetically. Then merge them with a sequential pass through both.
You create a file of errors which are just the offset of the word in
the document. Then sort that by offset and merge back with the
original document.
You may need an external sort. See
http://mindprod.com/jgloss/sort.html
Other than the sort, you need only a few K of RAM for the two buffers.
--
Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching. |
|
| Back to top |
|
 |
Chris Uppal Guest
|
Posted: Thu Apr 06, 2006 10:12 am Post subject: Re: Loading big files into memory |
|
|
kubek wrote:
| Quote: | I am dealing with a problem - in my software I need to check whether
written by user string or set of strings is spelled correctly. I have a
complete dictionary (Polish, text file of size of over 30mb), in which
I can check whether entered string is correct (if it exists in file).
Problem is that I can't load such a big file into memory (over 2,6
million lines).
|
One simple first step would be to represent your text data as binary. The
chances are that you don't need full Unicode support, but can represent all
your character data with 1 byte per character (using the appropriate code page
(Windows 1250 or ISO 8859-2 perhaps).
Beyond that, you should consider the algorithm you are using. You don't state
what you are currently doing so it's not possible to suggest improvements, but
there are many clever algorithms available. You might want to look at Bentley
and Sedwick's paper "Fast algorithms for sorting and searching strings" (which
you'll find only at http://www.cs.princeton.edu/~rs/strings/) for one example.
-- chris |
|
| Back to top |
|
 |
Ian Wilson Guest
|
Posted: Thu Apr 06, 2006 3:12 pm Post subject: Re: Loading big files into memory |
|
|
kubek wrote:
| Quote: | Hi.
I am dealing with a problem - in my software I need to check whether
written by user string or set of strings is spelled correctly. I have a
complete dictionary (Polish, text file of size of over 30mb), in which
I can check whether entered string is correct (if it exists in file).
Problem is that I can't load such a big file into memory (over 2,6
million lines). I have also modified the maximum heap size (java
-Xms16M -Xmx256M myAppl) and it worked, but...I have checked in
TaskManager and observed, that my application takes over 180mb of
memory - which in my opinion is quite much
Do You have any ideas how to change my text file (dictionary) - I don't
know, for example into binary file (if it makes sense), how to compress
or represent all 2,6 million strings in other way, that they all can be
loaded into memory in reasonable time and allocating reasonable amount
of memory?
If binary file is useful - how to "convert" my text file into it?
What is the best way to store it in memory - ArrayList, Collection or
something else?
Please help me, because it's urgent.
I need some ideas, key words, links where I can look for solution, or
any other help will be helpful
|
30MB seems quite large for a word-list:
$ ls -l /usr/share/dict/linux.words
-rw-r--r-- 1 root root 409305 Jun 24 2002
/usr/share/dict/linux.words
$ wc -l /usr/share/dict/linux.words
45427 /usr/share/dict/linux.words
So 45427 English words fits into under half a MB.
Does your dictionary include definitions? If so, maybe you can discard
those out as they will not be needed for spell-checking. |
|
| Back to top |
|
 |
kubek Guest
|
Posted: Thu Apr 06, 2006 8:12 pm Post subject: Re: Loading big files into memory |
|
|
No, there are no definitions in the file.
The difference is that my dictionary contains over 2.5 million words,
so it is much more than 45427 words. The problem in Polish language is
that there is lots of exceptions and inflections - that is why there
are so many position in the file.
Maybe changing it into binary file instead of character file would be a
good idea, but I have no clue how to achieve this ;(
Can anyone help me with that?
Chris |
|
| Back to top |
|
 |
Oliver Wong Guest
|
Posted: Thu Apr 06, 2006 9:12 pm Post subject: Re: Loading big files into memory |
|
|
"kubek" <kubek1982 (AT) o2 (DOT) pl> wrote in message
news:1144353564.297702.278290 (AT) g10g2000cwb (DOT) googlegroups.com...
| Quote: | No, there are no definitions in the file.
The difference is that my dictionary contains over 2.5 million words,
so it is much more than 45427 words. The problem in Polish language is
that there is lots of exceptions and inflections - that is why there
are so many position in the file.
Maybe changing it into binary file instead of character file would be a
good idea, but I have no clue how to achieve this ;(
Can anyone help me with that?
|
Didn't you see Eric Sosman's post? What's wrong with the DAWG he
described in his post?
- Oliver |
|
| Back to top |
|
 |
Roedy Green Guest
|
Posted: Thu Apr 06, 2006 10:12 pm Post subject: Re: Loading big files into memory |
|
|
On 6 Apr 2006 12:59:24 -0700, "kubek" <kubek1982 (AT) o2 (DOT) pl> wrote, quoted
or indirectly quoted someone who said :
| Quote: | Maybe changing it into binary file instead of character file would be a
|
The very best you would get in a halving in size, if you flipped from
16-bit chars to 8-bit polish encoding ISO-8859-2
See http://mindprod.com/jgloss/encoding.html
and
http://mindprod.com/jgloss/fileio.html
Dictionaries traditionally use clever encodings so that two words
beginning with the root does not repeat the root.
--
Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching. |
|
| Back to top |
|
 |
Guest
|
Posted: Thu Apr 06, 2006 11:12 pm Post subject: Re: Loading big files into memory |
|
|
kubek wrote:
| Quote: | No, there are no definitions in the file.
The difference is that my dictionary contains over 2.5 million words,
so it is much more than 45427 words. The problem in Polish language is
that there is lots of exceptions and inflections - that is why there
are so many position in the file.
Maybe changing it into binary file instead of character file would be a
good idea...
|
Indeed...
I wrote a custom spellchecker in Java years ago. Using a very basic
Huffman encoding / binary tree I got down to 22 bits per word,
including very fast lookup tables (accesible by binary search). With
such an encoding you could fit your whole dictionary in memory.
Using such an encoding would probably get you below 7 Mb of
memory for your 2.5 million words (I'd guess there are much
more similarities in you 2.5 million words than in the
french/english dictionaries I used, so you'd probably need even
a little less than that).
That was for the efficient data structure (dwarfing by order of
magnitude anything based on String objects). Note that this is not
anywhere near a record... Much more sophisticated
algorithms can represent dictionary words using less than
10 bits per word.
For the spellchecking part I used several techniques (including, but
not limited to: double-metaphones, Levenhstein's edit distance, etc.).
I got a huge french dictionary entirely fitting in around 2 Mb... Way
smaller and faster than anything Jazzy (open source spellchecker for
Java) could ever achieve back in the days (didn't check up Jazzy since
then to see if it was still so bad/ memory hungry/ slow/ etc.).
| Quote: | Can anyone help me with that?
|
Eric Sosman proposed you a link for a paper describing an
algorithm very efficient memory-wise and performance-wise
for a scrabble-type application.
If you want simply a spellchecker (or a scrabble game based
on a simple spellchecker, which must be doable even if not
as efficient as the nice technique proposed by Eric), a basic
Huffman encoded binary tree + lookup tables will do too.
In any case you'll have to do some coding and some "low-level"
bit manipulation.
For such a purpose you ain't gonna achieve reasonable memory
usage / loading time / etc. using String objects / regular charset
encodings / etc.
So my "help" would be to recommend using basic algorithms and
data structures to solve your problem.
Dictionary encoding / representation is a common problem and
Google will help you find lots of information on the subject.
If you're really into developing a Scrabble game Eric already
gave you a link and I'm sure Google could also help locate
additional information.
Alex
P.S: if you're controlling the systems where your program is
going to be deployed (eg an online Scrabble game where the
logic would be done on the server-side on some Un*x system),
you could also "cheat" by simply wrapping calls to ispell/aspell
for the (fast/memory efficient) spellchecking functionalities of
your app. |
|
| Back to top |
|
 |
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
|