We usually try to give some kind of interesting background story to motivate projects: You know, animals, the environment, superheroes, or anything else more interesting than just computer science. This project essentially defies our abilities to come up with a clever story. Owing to the language oriented tie-in, we'll let you come up with the clever story for us, Mad Lib style. By the way, that's Mad Lib and not Madlib.
Evil Dr. last name is cooking up a plot which will destroy all plural noun on Earth. Deep inside his underground laboratory in European country , he is forming a/an adjective plan to attack Earth with demon plural noun from the number th dimension. Someone must talk some sense into this adjective man! But, it must be done carefully. Dr. same last name is adjective about proper spelling. Because of his mental disorder , he is likely to crack completely and destroy the Earth without a second thought if he even glances at a message with misspelled words in it. What are we to do?
The obvious solution is to have you brave CS190M students come up with a spell-checker. We're going to break it into 4 parts:
Before you can parse the words out of the files, you need to know what files you want to open, and then open them. For this project, the names of the files will be passed in as command line arguments. For example:
java SpellChecker document.txt dictionary.txt 4
In this example, "document.txt" is the name of the document we want to check, and "dictionary.txt" is the name of the dictionary we want to use to do the checking. Finally, the value "4" is the number of threads to be used. A String holding each value is passed as an element of the args array parameter of the main() method. Do not hardcode any of these values! If the args array is not exactly 3 elements long, your program should print a usage message and simply exit. If either the document or the dictionary file cannot be opened (perhaps it doesn't exist), your program should print an appropriate error message to standard output before it exits.
Clearly, your driver code will be located in a class called SpellChecker.java.
You need to parse the document file into individual words. You will need to keep track of the number of times each word is present in the input file and store these words alphabetically in your HashList structure which we will cover in detail in the Storing the Words section. You should ignore capitalization differences between words and store each word in lower case. For the document file, you should consider a word to be any contiguous sequence of letters (A-Z and a-z). A word cannot contain any non-letter characters. Examples of how some words should be parsed are shown below.
A larger input file for you to test with can be found here.
You may need to examine input character by character in order to parse it properly.
You do not need to do any special parsing for the dictionary file. You can assume that a dictionary file consists of a list of unique words, where each individual word is on a line by itself.
So, you need to store these words somehow. We require you to store them in a kind of a hash table called HashList which you will implement as an array of 26 WordList objects. A hash table is a way of organizing information to make a "look up", i.e., to see if a value already exists, efficient. Usually a hash table involves an array. Adding a piece of data to a hash table involves using a hash function to decide which element of the array to put the piece of data into.
In our case, we are going to keep things simple. Inside a HashList object, each of the 26 elements of the array of WordList objects corresponds to a letter of the alphabet. Every legal word must start with some letter between 'a' and 'z'. Somehow we want to store a word in the element of the array which corresponds to the letter it starts with. How will we do that? Each WordList object contains a linked list. Below is a graphic illustrating what the data structure might look like:
Although WordList is the class containing the linked list, you also must implement another class WordNode to hold the data for each node in each list.
Class HashList must implement the following methods:
void add(String word)String to the WordList located at the index corresponding to the first character of word by calling its add() method in turn.boolean find( String word )true if word can be found in the WordList corresponding to the first character of word and false otherwise.WordNode get( int index )WordNode at location index counting all the entries in the entire hash table structure. It does so by determining which WordList object the given index should be inside of and then indexes into the appropriate space inside of it.
int getSize()
Class WordList must implement the following methods:
void add(String word)String to the WordList. If word is already contained in the WordList, this method should increment the WordNode containing it. Otherwise, this method should insert a WordNode into the linked list in sorted order, thereby keeping the linked list sorted.boolean find( String word )true if there is a WordNode in the linked list whose word matches the word argument and false otherwise.WordNode get( int index )WordNode at location index in the linked list. If index is out of range, you should throw a IndexOutOfBoundsException. Note that, to hide the linked list structure from the outside world, each call to this method should create a new WordNode with a null next reference and return that instead of returning the actual WordNode from the linked list.
int getSize()
Class WordNode must implement the following methods:
WordNode getNext()WordNode. Unlike the getNext() call in WordList, this method does not return a clone.
void setNext(WordNode node)WordNode to be node.
String getWord()WordNode.int getCount()void changeCount(int value)value to the count of the number of occurrences of a given word.
First, you must produce a listing of all the words in the document. You must open the document file specified by the command line argument, parse all the words out of it, add them all to a HashList object, and then create a new file called "words.txt" which contains an alphabetical list of every word in the file, one word per line, a colon and a space, followed by the number of times the word occurred in the file. The file words.txt is sample output for the input file document.txt.
Next, in order to find misspelled words, create a HashList for the dictionary file as well and read all the words into it. Then, iterate through every word in the HashList for the document and see if it can be found in the HashList for the dictionary.
You should create an output file called "misspelled.txt", and, in alphabetical order, output each word which cannot be found in the dictionary on a single line. The file misspelled.txt is sample output for the input files document.txt and dictionary.txt.
To be a truly useful spell checker, it is necessary to suggest the word the individual might have intended. To do this, it is necessary to somehow measure how different a misspelled word is from a correctly spelled word. To do this, we use something called the Levenshtein distance, which measures the smallest number of operations (inserting a letter, deleting a letter, or changing a letter) it takes to transform one word into another.
We don't expect you to understand how the formula to compute the Levenshtein distance works, but we do expect you to be able to replicate it. The process used to find it is called dynamic programming, which is an advanced topic you will learn about more in an algorithms course. The basic idea behind dynamic programming is to compute solutions to many smaller problems and use all these stored solutions to find a solution to a larger problem.
Enough theory, let's put it into practice. We want to find the edit distance between a String which is m characters long and a String which is n characters long. Of course, m is allowed to equal n. We make a table which has m + 1 rows and n + 1 columns. Every row corresponds to one letter of the first String (plus an extra row for the empty String), and every column corresponds to one letter of the second String (plus an extra column for the empty String). An entry (i, j) in the table gives the smallest number of edits needed to change the first i characters of the first word into the first j characters of the second word. So, if we can fill out the table correctly, the (m, n) entry in the table will tell us the smallest number of edits needed to change the entire first word into the entire second word.
How do we do that? Well, row 0 gives the edits needed to turn an empty first String into various prefixes of the second String. To turn an empty String into another String, the only way to do it is to add characters. So, row 0 is initialized with values starting at 0 (the cost to change an empty String into another empty String) all the way up to n (the cost to change an empty String into the second String). Column 0 is similar. Its values range from 0 (again, the empty String to empty String cost) up to m (the cost to change an empty String into the first String).
With row and column 0 completely filled out, we can begin to fill out the rest of the table. For any given entry (i, j) in the table, its value is the minimum of three possible values: entry (i - 1, j) + 1 (corresponds to deletion), entry (i, j - 1) + 1 (corresponds to insertion), entry (i - 1, j - 1) if the (i - 1)th character of the first String is the same as the (j - 1)th character of the second String or 1 greater than that entry otherwise (corresponds to substitution). Below is a table taken from the Wikipedia article on Levenshtein distance which shows an entire table filled out to change the String "kitten" into the String "sitting". Note that the distinction between "row" and "column" is somewhat arbitrary, since an insertion is the opposite of a deletion and a substitution could go in either direction. Thus, the numbers of edits needed to change "kitten" to "sitting" is the same as the number needed to change "sitting" to "kitten".
| k | i | t | t | e | n | ||
|---|---|---|---|---|---|---|---|
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | |
| s | 1 | 1 | 2 | 3 | 4 | 5 | 6 |
| i | 2 | 2 | 1 | 2 | 3 | 4 | 5 |
| t | 3 | 3 | 2 | 1 | 2 | 3 | 4 |
| t | 4 | 4 | 3 | 2 | 1 | 2 | 3 |
| i | 5 | 5 | 4 | 3 | 2 | 2 | 3 |
| n | 6 | 6 | 5 | 4 | 3 | 3 | 2 |
| g | 7 | 7 | 6 | 5 | 4 | 4 | 3 |
Now that you know how to compute the Levenshtein distance, you know how to see how close two words are to each other. Real, industrial-strength spell checkers use heuristics more complicated than Levenshtein distance to make suggestions. (Issues like similarity of sound, keyboard distance, and other physical and psychological factors are taken into account.) For us, Levenshtein distance is plenty.
For each misspelled word, you must traverse the entire dictionary WordList and find the word with the smallest Levenshtein distance. Search the entire dictionary in order and keep the first lowest distance word you find. That is, if there is a tie between two words which both have the same Levenshtein distance, pick the one that comes first in the alphabet. Store these suggestions once you find them.
Once you have a list of all the suggestions, open a new file called suggestions.txt. Print out each misspelled word a single time, in alphabetical order, followed by a space, a hyphen, and another space, and the closest suggestion you found. The file suggestions.txt is sample output for the input files document.txt and dictionary.txt.
Oh, yes. This is a multicore course. Use threads to find the spelling suggestions. This should be a simple domain decomposition in which you divide up the list of misspelled words among 2 or more threads. Design your code to work with any reasonable number of threads specified at the command line.
You need to do 4 things:
WordList object and output the unique words from the document (and the number of times each appears) into a file called "words.txt"."misspelled.txt"containing each unique, misspelled word in alphabetical order."suggestions.txt". Use threads to find them faster.SpellChecker, HashList, WordList, and WordNode) you have been assigned. We have specified some methods as necessary, but you are free to write other methods as needed. Some thought to constructors is definitely warranted since we have given no guidelines.if statements, you are doing something wrong. Most problems we give have fairly straight-forward implementations in code. If you think your approach is too convoluted, get a sanity check from a TA. 1 minute of design == 10 minutes of coding == 100 minutes of debugging.words.txt, misspelled.txt, and suggestions.txt. Those are the files that will be diffed.
When you have tested your code and are satisfied with the results, you will need to turn in all 4 (or more) of your Java source files: SpellChecker.java, HashList.java, WordList.java, and WordNode.java. Make sure you run and test your code from the command line.
To turn in your project, navigate to the directory containing your source files. Then type the following command:
turnin -v -c cs190m -p project3 *.java
You can submit your code as many times as you want. Submit early and often.
Your project must compile with the command javac *.java and run with the command as specified above.
We will be using a Unix command called diff to grade this project. The diff command compares two files and outputs the differences to the screen. Even minor differences such as extra blank lines, a capital letter vs. a lowercase letter, or even an extra space at the end of a line are shown. Therefore, it is important that you follow the specifications given for the format of the output files. You may check your files against the given sample output files simply by running the following command in your terminal:
diff myOutputFile.txt sampleOutputFile.txt
When diff is run on two identical files, no output is printed to the screen.
Grading will be done based on the following guidelines:
30 points will be awarded based on parsing words, counting their occurrences, and printing them, correctly formatted, in alphabetical order.
You can earn up to 20 points if you correctly identify the misspelled words and list them, correctly formatted, in alphabetical order.
Another 30 points can be yours if you suggest the appropriate spellings for all words.
You can earn up to 10 points for correctly using threads and getting speedup in the Levenshtein distance computation.
As always, you should use good programming style and commenting in order to receive full credit. Every Java file should have a comment header at the top giving your name, the course, the project, the date, and a short explanation of the class(es) in the file.
Note: Submissions which do not compile will automatically score 0 points.
As always, be sure to uphold the high standards of academic honesty required by this course. Failure to do so can lead to consequences as dire as failing the course and expulsion from the school. If you have any questions about whether or not a given action is allowed, please ask your instructor or TA.