AppletTalk.com Forum Index AppletTalk.com
Java discussions newsgroups
 
Archives   FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

Keyword extractor's source code....where I can find it???

 
Post new topic   Reply to topic    AppletTalk.com Forum Index -> JVM, native methods and hardware
View previous topic :: View next topic  
Author Message
giugy
Guest





PostPosted: Thu Jan 11, 2007 9:59 pm    Post subject: Keyword extractor's source code....where I can find it??? Reply with quote



Hi,sorry for my english but I don't speak it very well....

Someone knows where I can find the Keyword Extractor source code
written in java? A software that analyzes a text and extract the
keyword of the text (the most present words in the text....for example
the word "hello" is present forty times,the word "thanks" is present
thirty times....).

I need to see the software's source code written in java in order to
understand as it works....

Thaks,bye
Back to top
glen herrmannsfeldt
Guest





PostPosted: Fri Jan 12, 2007 7:57 am    Post subject: Re: Keyword extractor's source code....where I can find it?? Reply with quote



giugy wrote:

Quote:
Someone knows where I can find the Keyword Extractor source code
written in java? A software that analyzes a text and extract the
keyword of the text (the most present words in the text....for example
the word "hello" is present forty times,the word "thanks" is present
thirty times....).

I need to see the software's source code written in java in order to
understand as it works....

It is very easy to write in Java.

First read a line and extract words using StringTokenizer. Then
use a Hashtable to find out if you have seen that word before.
If so, increment a counter. If not, add it to the Hashtable with
a count of 1. I store a long[] in the hashtable for convenience
in incrementing, but others will do something different.

One trick, though. After you extract words with StringTokenizer and
find they are not in the table, create a new String to store the
reference in the hash table. If you don't it will take up too much
memory, as the whole line of characters is stored for each word.

After you finish reading the file, go through the Hashtable,
extract words and counts, and print them out.

It should not take long at all to write.

-- glen
Back to top
giugy
Guest





PostPosted: Tue Jan 16, 2007 11:20 pm    Post subject: Re: Keyword extractor's source code....where I can find it?? Reply with quote



Yes, I have found a code like this....

import java.io.*;
import java.util.*;

class Counter implements Comparable {
private String word;
private int count;
public Counter(String word) {
this.word = word;
count = 1;
}
public void increment() { count++; }
public String toString() {
return "\n" + word + " [" + count + "]";
}
public boolean equals(Object obj) {
return obj instanceof Counter &&
((Counter)obj).word.equals(word);
}
public int hashCode() {
return word.hashCode();
}
public int compareTo(Object o) {
return word.compareTo(((Counter)o).word);
}
}

class CounterSet extends AbstractSet {
private Map set = new TreeMap();
public void addOrIncrement(String s) {
Counter c = new Counter(s);
if (set.containsKey(c))
((Counter)set.get(c)).increment();
else
set.put(c, c);
}
public Iterator iterator() {
return set.keySet().iterator();
}
public int size() {
return set.size();
}
public String toString() {
return set.keySet().toString();
}
}

class WordCount {
private FileReader file;
private StreamTokenizer st;

private CounterSet counts = new CounterSet();
WordCount(String filename)
throws FileNotFoundException {
try {
file = new FileReader(filename);
st = new StreamTokenizer(
new BufferedReader(file));
st.ordinaryChar('.');
st.ordinaryChar('-');
st.lowerCaseMode(true);

} catch(FileNotFoundException e) {
System.err.println(
"Could not open " + filename);
throw e;
}
}
void cleanup() {
try {
file.close();
} catch(IOException e) {
System.err.println(
"file.close() unsuccessful");
}
}
void countWords() {
try {
while(st.nextToken() !=
StreamTokenizer.TT_EOF) {
String s = "a";
switch(st.ttype) {
case StreamTokenizer.TT_EOL:
s = new String("EOL");
break;

case StreamTokenizer.TT_NUMBER:
// s = Double.toString(st.nval);
break;

case StreamTokenizer.TT_WORD:
s = st.sval;
break;
default: // single character in ttype
s = String.valueOf((char)st.ttype);
}

if(s.length() > 3)
counts.addOrIncrement(s);
}
} catch(IOException e) {
System.err.println(
"st.nextToken() unsuccessful");
}
}
public Iterator iterator() {
return counts.iterator();
}
public String toString() {
return counts.toString();
}
}

public class KeyWordExtractor {
public static void main(String[] args)
throws FileNotFoundException {
for(int i = 0; i < args.length; i++){
WordCount wc = new WordCount(args[i]);
wc.countWords();
System.out.println("WORD = " + wc);
wc.cleanup();
}
}
}


and it give me to occurrency of every world in the text...in example if
i give in input a text like (a stupid example) "java function java
library function java" in output I obtain WORD = [function[2] ,
java[3] , library[1]] ....that are the occurrences of the word in the
text,but my problem is that I need in output not all the word of the
text...but only the the word that appears many times in the text...in
this case java that is the keyword of the text....WORD = [java]

I know that there is still little code to write,but I do not know well
java and so I don't succeed to write it!!!
Please Help me....THANKS!!!

glen herrmannsfeldt ha scritto:

Quote:
giugy wrote:

Someone knows where I can find the Keyword Extractor source code
written in java? A software that analyzes a text and extract the
keyword of the text (the most present words in the text....for example
the word "hello" is present forty times,the word "thanks" is present
thirty times....).

I need to see the software's source code written in java in order to
understand as it works....

It is very easy to write in Java.

First read a line and extract words using StringTokenizer. Then
use a Hashtable to find out if you have seen that word before.
If so, increment a counter. If not, add it to the Hashtable with
a count of 1. I store a long[] in the hashtable for convenience
in incrementing, but others will do something different.

One trick, though. After you extract words with StringTokenizer and
find they are not in the table, create a new String to store the
reference in the hash table. If you don't it will take up too much
memory, as the whole line of characters is stored for each word.

After you finish reading the file, go through the Hashtable,
extract words and counts, and print them out.

It should not take long at all to write.

-- glen
Back to top
giugy
Guest





PostPosted: Tue Jan 16, 2007 11:20 pm    Post subject: Re: Keyword extractor's source code....where I can find it?? Reply with quote

Yes, I have found a code like this....

import java.io.*;
import java.util.*;

class Counter implements Comparable {
private String word;
private int count;
public Counter(String word) {
this.word = word;
count = 1;
}
public void increment() { count++; }
public String toString() {
return "\n" + word + " [" + count + "]";
}
public boolean equals(Object obj) {
return obj instanceof Counter &&
((Counter)obj).word.equals(word);
}
public int hashCode() {
return word.hashCode();
}
public int compareTo(Object o) {
return word.compareTo(((Counter)o).word);
}
}

class CounterSet extends AbstractSet {
private Map set = new TreeMap();
public void addOrIncrement(String s) {
Counter c = new Counter(s);
if (set.containsKey(c))
((Counter)set.get(c)).increment();
else
set.put(c, c);
}
public Iterator iterator() {
return set.keySet().iterator();
}
public int size() {
return set.size();
}
public String toString() {
return set.keySet().toString();
}
}

class WordCount {
private FileReader file;
private StreamTokenizer st;

private CounterSet counts = new CounterSet();
WordCount(String filename)
throws FileNotFoundException {
try {
file = new FileReader(filename);
st = new StreamTokenizer(
new BufferedReader(file));
st.ordinaryChar('.');
st.ordinaryChar('-');
st.lowerCaseMode(true);

} catch(FileNotFoundException e) {
System.err.println(
"Could not open " + filename);
throw e;
}
}
void cleanup() {
try {
file.close();
} catch(IOException e) {
System.err.println(
"file.close() unsuccessful");
}
}
void countWords() {
try {
while(st.nextToken() !=
StreamTokenizer.TT_EOF) {
String s = "a";
switch(st.ttype) {
case StreamTokenizer.TT_EOL:
s = new String("EOL");
break;

case StreamTokenizer.TT_NUMBER:
// s = Double.toString(st.nval);
break;

case StreamTokenizer.TT_WORD:
s = st.sval;
break;
default: // single character in ttype
s = String.valueOf((char)st.ttype);
}

if(s.length() > 3)
counts.addOrIncrement(s);
}
} catch(IOException e) {
System.err.println(
"st.nextToken() unsuccessful");
}
}
public Iterator iterator() {
return counts.iterator();
}
public String toString() {
return counts.toString();
}
}

public class KeyWordExtractor {
public static void main(String[] args)
throws FileNotFoundException {
for(int i = 0; i < args.length; i++){
WordCount wc = new WordCount(args[i]);
wc.countWords();
System.out.println("WORD = " + wc);
wc.cleanup();
}
}
}


and it give me to occurrency of every world in the text...in example if
i give in input a text like (a stupid example) "java function java
library function java" in output I obtain WORD = [function[2] ,
java[3] , library[1]] ....that are the occurrences of the word in the
text,but my problem is that I need in output not all the word of the
text...but only the the word that appears many times in the text...in
this case java that is the keyword of the text....WORD = [java]

I know that there is still little code to write,but I do not know well
java and so I don't succeed to write it!!!
Please Help me....THANKS!!!

glen herrmannsfeldt ha scritto:

Quote:
giugy wrote:

Someone knows where I can find the Keyword Extractor source code
written in java? A software that analyzes a text and extract the
keyword of the text (the most present words in the text....for example
the word "hello" is present forty times,the word "thanks" is present
thirty times....).

I need to see the software's source code written in java in order to
understand as it works....

It is very easy to write in Java.

First read a line and extract words using StringTokenizer. Then
use a Hashtable to find out if you have seen that word before.
If so, increment a counter. If not, add it to the Hashtable with
a count of 1. I store a long[] in the hashtable for convenience
in incrementing, but others will do something different.

One trick, though. After you extract words with StringTokenizer and
find they are not in the table, create a new String to store the
reference in the hash table. If you don't it will take up too much
memory, as the whole line of characters is stored for each word.

After you finish reading the file, go through the Hashtable,
extract words and counts, and print them out.

It should not take long at all to write.

-- glen
Back to top
glen herrmannsfeldt
Guest





PostPosted: Wed Jan 17, 2007 8:11 am    Post subject: Re: Keyword extractor's source code....where I can find it?? Reply with quote

giugy wrote:
Quote:
Yes, I have found a code like this....

import java.io.*;
import java.util.*;

class Counter implements Comparable {
private String word;
private int count;
public Counter(String word) {
this.word = word;
count = 1;
}
public void increment() { count++; }
public String toString() {
return "\n" + word + " [" + count + "]";

Change this to:

return count=" "+word;

The the output will have a list of count followed by word, and
can be input to the unix command

sort -rn unsortedfile > sortedfile

which will output the list with the most common word first.



(snip)

-- glen
Back to top
giugy
Guest





PostPosted: Wed Jan 17, 2007 3:15 pm    Post subject: Re: Keyword extractor's source code....where I can find it?? Reply with quote

Sorry but maybe I make a stupid errore....if I change
return "\n" + word + " [" + count + "]";
with
return count=" "+word;

I obtain an error like this "found: java.lang.String required: int" ,
because count is an it and word is a string and the function required
gives back a String....how can i do?



glen herrmannsfeldt ha scritto:

Quote:
giugy wrote:
Yes, I have found a code like this....

import java.io.*;
import java.util.*;

class Counter implements Comparable {
private String word;
private int count;
public Counter(String word) {
this.word = word;
count = 1;
}
public void increment() { count++; }
public String toString() {
return "\n" + word + " [" + count + "]";

Change this to:

return count=" "+word;

The the output will have a list of count followed by word, and
can be input to the unix command

sort -rn unsortedfile > sortedfile

which will output the list with the most common word first.



(snip)

-- glen
Back to top
giugy
Guest





PostPosted: Wed Jan 17, 2007 3:15 pm    Post subject: Re: Keyword extractor's source code....where I can find it?? Reply with quote

Sorry but maybe I make a stupid errore....if I change
return "\n" + word + " [" + count + "]";
with
return count=" "+word;

I obtain an error like this "found: java.lang.String required: int" ,
because count is an it and word is a string and the function required
gives back a String....how can i do?



glen herrmannsfeldt ha scritto:

Quote:
giugy wrote:
Yes, I have found a code like this....

import java.io.*;
import java.util.*;

class Counter implements Comparable {
private String word;
private int count;
public Counter(String word) {
this.word = word;
count = 1;
}
public void increment() { count++; }
public String toString() {
return "\n" + word + " [" + count + "]";

Change this to:

return count=" "+word;

The the output will have a list of count followed by word, and
can be input to the unix command

sort -rn unsortedfile > sortedfile

which will output the list with the most common word first.



(snip)

-- glen
Back to top
giugy
Guest





PostPosted: Wed Jan 17, 2007 3:15 pm    Post subject: Re: Keyword extractor's source code....where I can find it?? Reply with quote

Sorry but maybe I make a stupid errore....if I change
return "\n" + word + " [" + count + "]";
with
return count=" "+word;

I obtain an error like this "found: java.lang.String required: int" ,
because count is an it and word is a string and the function required
gives back a String....how can i do?



glen herrmannsfeldt ha scritto:

Quote:
giugy wrote:
Yes, I have found a code like this....

import java.io.*;
import java.util.*;

class Counter implements Comparable {
private String word;
private int count;
public Counter(String word) {
this.word = word;
count = 1;
}
public void increment() { count++; }
public String toString() {
return "\n" + word + " [" + count + "]";

Change this to:

return count=" "+word;

The the output will have a list of count followed by word, and
can be input to the unix command

sort -rn unsortedfile > sortedfile

which will output the list with the most common word first.



(snip)

-- glen
Back to top
glen herrmannsfeldt
Guest





PostPosted: Wed Jan 17, 2007 3:25 pm    Post subject: Re: Keyword extractor's source code....where I can find it?? Reply with quote

giugy wrote:

Quote:
Sorry but maybe I make a stupid errore....if I change
return "\n" + word + " [" + count + "]";
with
return count=" "+word;

I obtain an error like this "found: java.lang.String required: int" ,

Sorry, it was supposed to say return count+" "+word;

In both the original and this one, the int is converted to String.

By the way, you don't need to post three times for us to read it.

-- glen
Back to top
Display posts from previous:   
Post new topic   Reply to topic    AppletTalk.com Forum Index -> JVM, native methods and hardware All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © 2001, 2006 phpBB Group
SEO toolkit © 2004-2006 webmedic.