AppletTalk.com Forum Index AppletTalk.com
Java discussions newsgroups
 
Archives   FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

Loading big files into memory
Goto page 1, 2  Next
 
Post new topic   Reply to topic    AppletTalk.com Forum Index -> Java Help
View previous topic :: View next topic  
Author Message
kubek
Guest





PostPosted: Wed Apr 05, 2006 10:12 pm    Post subject: Loading big files into memory Reply with quote



Hi.
I am dealing with a problem - in my software I need to check whether
written by user string or set of strings is spelled correctly. I have a
complete dictionary (Polish, text file of size of over 30mb), in which
I can check whether entered string is correct (if it exists in file).
Problem is that I can't load such a big file into memory (over 2,6
million lines). I have also modified the maximum heap size (java
-Xms16M -Xmx256M myAppl) and it worked, but...I have checked in
TaskManager and observed, that my application takes over 180mb of
memory - which in my opinion is quite much Smile
Do You have any ideas how to change my text file (dictionary) - I don't
know, for example into binary file (if it makes sense), how to compress
or represent all 2,6 million strings in other way, that they all can be
loaded into memory in reasonable time and allocating reasonable amount
of memory?
If binary file is useful - how to "convert" my text file into it?
What is the best way to store it in memory - ArrayList, Collection or
something else?
Please help me, because it's urgent.
I need some ideas, key words, links where I can look for solution, or
any other help will be helpful Smile
Thanks in advance
Chris
Back to top
Monique Y. Mudama
Guest





PostPosted: Wed Apr 05, 2006 10:12 pm    Post subject: Re: Loading big files into memory Reply with quote



["Followup-To:" header set to comp.lang.java.help.] On 2006-04-05,
kubek penned:
Quote:
Ok, let's say this is some solution. But still accessing the file
each time doesn't seem to be most optimal. What for example in case
of a Scrabble game, when not only checking the file content against
given string must be performed, but also returning all strings, that
can be created from word already placed on board and some letters
from letter stand? Can You see my point? Chris

Sure, but that's not the question you originally asked.

For every solution, there will be a situation in which that approach
is suboptimal.

--
monique

Ask smart questions, get good answers:
http://www.catb.org/~esr/faqs/smart-questions.html
Back to top
kubek
Guest





PostPosted: Wed Apr 05, 2006 10:12 pm    Post subject: Re: Loading big files into memory Reply with quote



Ok, let's say this is some solution.
But still accessing the file each time doesn't seem to be most optimal.
What for example in case of a Scrabble game, when not only checking the
file content against given string must be performed, but also returning
all strings, that can be created from word already placed on board and
some letters from letter stand?
Can You see my point?
Chris
Back to top
Oliver Wong
Guest





PostPosted: Wed Apr 05, 2006 10:12 pm    Post subject: Re: Loading big files into memory Reply with quote

"kubek" <kubek1982 (AT) o2 (DOT) pl> wrote in message
news:1144272769.923853.5860 (AT) i40g2000cwc (DOT) googlegroups.com...
Quote:
Hi.
I am dealing with a problem - in my software I need to check whether
written by user string or set of strings is spelled correctly. I have a
complete dictionary (Polish, text file of size of over 30mb), in which
I can check whether entered string is correct (if it exists in file).
Problem is that I can't load such a big file into memory (over 2,6
million lines). I have also modified the maximum heap size (java
-Xms16M -Xmx256M myAppl) and it worked, but...I have checked in
TaskManager and observed, that my application takes over 180mb of
memory - which in my opinion is quite much Smile
Do You have any ideas how to change my text file (dictionary) - I don't
know, for example into binary file (if it makes sense), how to compress
or represent all 2,6 million strings in other way, that they all can be
loaded into memory in reasonable time and allocating reasonable amount
of memory?
If binary file is useful - how to "convert" my text file into it?
What is the best way to store it in memory - ArrayList, Collection or
something else?
Please help me, because it's urgent.
I need some ideas, key words, links where I can look for solution, or
any other help will be helpful Smile
Thanks in advance

Sort your dictionary file. Instead of loading the entire dictionary
file, do a sort of binary search through the file to see if the word you're
checking against exists in there. Don't actually ever load the whole file in
memory at once. A clever OS should detect your read patterns and cache parts
of the file when possible.

- Oliver
Back to top
kubek
Guest





PostPosted: Wed Apr 05, 2006 11:12 pm    Post subject: Re: Loading big files into memory Reply with quote

Ok - my fault. Maybe I asked the question incorrectly or not precisely
enough.
However, I am more interested in solving the problem mentioned later:

Quote:
But still accessing the file
each time doesn't seem to be most optimal. What for example in case
of a Scrabble game, when not only checking the file content against
given string must be performed, but also returning all strings, that
can be created from word already placed on board and some letters
from letter stand? Can You see my point?

Summarizing:
What to do with my over 30mb text file, so it can be stored in memory,
so I can search through its content as often as I want?
Is there any way to "squeze" my text file to obtain smaller file
without increasing cost of search operation?
What is the best structure to keep it in memory?

Please, help.
Sorry for not being precise enough.
Chris
Back to top
kubek
Guest





PostPosted: Wed Apr 05, 2006 11:12 pm    Post subject: Re: Loading big files into memory Reply with quote

Ok - my fault. Maybe I asked the question incorrectly or not precisely
enough.
However, I am more interested in solving the problem mentioned later:

Quote:
But still accessing the file
each time doesn't seem to be most optimal. What for example in case
of a Scrabble game, when not only checking the file content against
given string must be performed, but also returning all strings, that
can be created from word already placed on board and some letters
from letter stand? Can You see my point?

Summarizing:
What to do with my over 30mb text file, so it can be stored in memory,
so I can search through its content as often as I want?
Is there any way to "squeze" my text file to obtain smaller file
without increasing cost of search operation?
What is the best structure to keep it in memory?

Please, help.
Sorry for not being precise enough.
Chris
Back to top
Eric Sosman
Guest





PostPosted: Wed Apr 05, 2006 11:12 pm    Post subject: Re: Loading big files into memory Reply with quote

kubek wrote On 04/05/06 17:53,:
Quote:
Ok, let's say this is some solution.
But still accessing the file each time doesn't seem to be most optimal.
What for example in case of a Scrabble game, when not only checking the
file content against given string must be performed, but also returning
all strings, that can be created from word already placed on board and
some letters from letter stand?

Funny you should mention Scrabble ... Look for

"The world's fastest Scrabble program," A.W. Appel
and G.J. Jacobson, Communications of the ACM Volume
31, Issue 5 (May 1988)

.... which you can find on-line if hard copy isn't easy to
come by. Their program used a data structure they called
a Directed Acyclic Word Graph, or DAWG, that allowed them
to store a large dictionary in a small space and search it
rapidly.

How "small" and how "rapidly?" I don't recall the values
from their article, but I decided to implement their method
for my own amusement and was able to produce a program that
played Scrabble better than I do. Its dictionary held about
twenty-five or thirty thousand words, amounting to perhaps a
quarter megabyte in raw form.

You may well be raising your eyebrows at numbers like
"thirty thousand" and "a quarter meg," and thinking that this
DAWG thing isn't impressive at all. But you haven't heard
the punch line yet: The machine I used had 64kB of memory
(with about 10kB taken by the O/S), an eight-bit CPU operating
at 4MHz, and two floppy drives accommodating 390kB each -- and
that's all. The DAWG was good enough to let me fit my quarter
megabyte into less than one fifth its raw size and still search
it quickly, so I think it qualifies as top DAWG.

--
Eric.Sosman (AT) sun (DOT) com
Back to top
trahan
Guest





PostPosted: Wed Apr 05, 2006 11:12 pm    Post subject: Re: Loading big files into memory Reply with quote

Look at B trees.
Back to top
Roedy Green
Guest





PostPosted: Thu Apr 06, 2006 1:12 am    Post subject: Re: Loading big files into memory Reply with quote

On 5 Apr 2006 14:32:51 -0700, "kubek" <kubek1982 (AT) o2 (DOT) pl> wrote, quoted
or indirectly quoted someone who said :

Quote:
Do You have any ideas how to change my text file (dictionary) - I don't
know, for example into binary file (if it makes sense), how to compress
or represent all 2,6 million strings in other way, that they all can be
loaded into memory in

There is another way to do spell checking that was popular back in the
DOS days before we had enough RAM to wash our feet in.

You sort your dictionary and your text file (of word/offset pairs)
alphabetically. Then merge them with a sequential pass through both.
You create a file of errors which are just the offset of the word in
the document. Then sort that by offset and merge back with the
original document.

You may need an external sort. See
http://mindprod.com/jgloss/sort.html


Other than the sort, you need only a few K of RAM for the two buffers.


--
Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.
Back to top
Chris Uppal
Guest





PostPosted: Thu Apr 06, 2006 10:12 am    Post subject: Re: Loading big files into memory Reply with quote

kubek wrote:

Quote:
I am dealing with a problem - in my software I need to check whether
written by user string or set of strings is spelled correctly. I have a
complete dictionary (Polish, text file of size of over 30mb), in which
I can check whether entered string is correct (if it exists in file).
Problem is that I can't load such a big file into memory (over 2,6
million lines).

One simple first step would be to represent your text data as binary. The
chances are that you don't need full Unicode support, but can represent all
your character data with 1 byte per character (using the appropriate code page
(Windows 1250 or ISO 8859-2 perhaps).

Beyond that, you should consider the algorithm you are using. You don't state
what you are currently doing so it's not possible to suggest improvements, but
there are many clever algorithms available. You might want to look at Bentley
and Sedwick's paper "Fast algorithms for sorting and searching strings" (which
you'll find only at http://www.cs.princeton.edu/~rs/strings/) for one example.

-- chris
Back to top
Ian Wilson
Guest





PostPosted: Thu Apr 06, 2006 3:12 pm    Post subject: Re: Loading big files into memory Reply with quote

kubek wrote:
Quote:
Hi.
I am dealing with a problem - in my software I need to check whether
written by user string or set of strings is spelled correctly. I have a
complete dictionary (Polish, text file of size of over 30mb), in which
I can check whether entered string is correct (if it exists in file).
Problem is that I can't load such a big file into memory (over 2,6
million lines). I have also modified the maximum heap size (java
-Xms16M -Xmx256M myAppl) and it worked, but...I have checked in
TaskManager and observed, that my application takes over 180mb of
memory - which in my opinion is quite much Smile
Do You have any ideas how to change my text file (dictionary) - I don't
know, for example into binary file (if it makes sense), how to compress
or represent all 2,6 million strings in other way, that they all can be
loaded into memory in reasonable time and allocating reasonable amount
of memory?
If binary file is useful - how to "convert" my text file into it?
What is the best way to store it in memory - ArrayList, Collection or
something else?
Please help me, because it's urgent.
I need some ideas, key words, links where I can look for solution, or
any other help will be helpful Smile

30MB seems quite large for a word-list:

$ ls -l /usr/share/dict/linux.words
-rw-r--r-- 1 root root 409305 Jun 24 2002
/usr/share/dict/linux.words

$ wc -l /usr/share/dict/linux.words
45427 /usr/share/dict/linux.words

So 45427 English words fits into under half a MB.

Does your dictionary include definitions? If so, maybe you can discard
those out as they will not be needed for spell-checking.
Back to top
kubek
Guest





PostPosted: Thu Apr 06, 2006 8:12 pm    Post subject: Re: Loading big files into memory Reply with quote

No, there are no definitions in the file.
The difference is that my dictionary contains over 2.5 million words,
so it is much more than 45427 words. The problem in Polish language is
that there is lots of exceptions and inflections - that is why there
are so many position in the file.
Maybe changing it into binary file instead of character file would be a
good idea, but I have no clue how to achieve this ;(
Can anyone help me with that?
Chris
Back to top
Oliver Wong
Guest





PostPosted: Thu Apr 06, 2006 9:12 pm    Post subject: Re: Loading big files into memory Reply with quote

"kubek" <kubek1982 (AT) o2 (DOT) pl> wrote in message
news:1144353564.297702.278290 (AT) g10g2000cwb (DOT) googlegroups.com...
Quote:
No, there are no definitions in the file.
The difference is that my dictionary contains over 2.5 million words,
so it is much more than 45427 words. The problem in Polish language is
that there is lots of exceptions and inflections - that is why there
are so many position in the file.
Maybe changing it into binary file instead of character file would be a
good idea, but I have no clue how to achieve this ;(
Can anyone help me with that?

Didn't you see Eric Sosman's post? What's wrong with the DAWG he
described in his post?

- Oliver
Back to top
Roedy Green
Guest





PostPosted: Thu Apr 06, 2006 10:12 pm    Post subject: Re: Loading big files into memory Reply with quote

On 6 Apr 2006 12:59:24 -0700, "kubek" <kubek1982 (AT) o2 (DOT) pl> wrote, quoted
or indirectly quoted someone who said :

Quote:
Maybe changing it into binary file instead of character file would be a

The very best you would get in a halving in size, if you flipped from
16-bit chars to 8-bit polish encoding ISO-8859-2

See http://mindprod.com/jgloss/encoding.html
and
http://mindprod.com/jgloss/fileio.html

Dictionaries traditionally use clever encodings so that two words
beginning with the root does not repeat the root.


--
Canadian Mind Products, Roedy Green.
http://mindprod.com Java custom programming, consulting and coaching.
Back to top
Guest






PostPosted: Thu Apr 06, 2006 11:12 pm    Post subject: Re: Loading big files into memory Reply with quote

kubek wrote:
Quote:
No, there are no definitions in the file.
The difference is that my dictionary contains over 2.5 million words,
so it is much more than 45427 words. The problem in Polish language is
that there is lots of exceptions and inflections - that is why there
are so many position in the file.
Maybe changing it into binary file instead of character file would be a
good idea...

Indeed...

I wrote a custom spellchecker in Java years ago. Using a very basic
Huffman encoding / binary tree I got down to 22 bits per word,
including very fast lookup tables (accesible by binary search). With
such an encoding you could fit your whole dictionary in memory.

Using such an encoding would probably get you below 7 Mb of
memory for your 2.5 million words (I'd guess there are much
more similarities in you 2.5 million words than in the
french/english dictionaries I used, so you'd probably need even
a little less than that).

That was for the efficient data structure (dwarfing by order of
magnitude anything based on String objects). Note that this is not
anywhere near a record... Much more sophisticated
algorithms can represent dictionary words using less than
10 bits per word.

For the spellchecking part I used several techniques (including, but
not limited to: double-metaphones, Levenhstein's edit distance, etc.).

I got a huge french dictionary entirely fitting in around 2 Mb... Way
smaller and faster than anything Jazzy (open source spellchecker for
Java) could ever achieve back in the days (didn't check up Jazzy since
then to see if it was still so bad/ memory hungry/ slow/ etc.).


Quote:
Can anyone help me with that?

Eric Sosman proposed you a link for a paper describing an
algorithm very efficient memory-wise and performance-wise
for a scrabble-type application.

If you want simply a spellchecker (or a scrabble game based
on a simple spellchecker, which must be doable even if not
as efficient as the nice technique proposed by Eric), a basic
Huffman encoded binary tree + lookup tables will do too.

In any case you'll have to do some coding and some "low-level"
bit manipulation.

For such a purpose you ain't gonna achieve reasonable memory
usage / loading time / etc. using String objects / regular charset
encodings / etc.

So my "help" would be to recommend using basic algorithms and
data structures to solve your problem.

Dictionary encoding / representation is a common problem and
Google will help you find lots of information on the subject.

If you're really into developing a Scrabble game Eric already
gave you a link and I'm sure Google could also help locate
additional information.

Alex


P.S: if you're controlling the systems where your program is
going to be deployed (eg an online Scrabble game where the
logic would be done on the server-side on some Un*x system),
you could also "cheat" by simply wrapping calls to ispell/aspell
for the (fast/memory efficient) spellchecking functionalities of
your app.
Back to top
Display posts from previous:   
Post new topic   Reply to topic    AppletTalk.com Forum Index -> Java Help All times are GMT
Goto page 1, 2  Next
Page 1 of 2

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © 2001, 2006 phpBB Group
SEO toolkit © 2004-2006 webmedic.