CS240 Spring 2011:LAB07


File I/O

Goal

To parse an input file containing text and output the counts for the occurences of words in the file by assigning a unique id to each unique word. Also encode the input file using the unique ids for the words.

The files for the lab can be downloaded from here. The files can be extracted by running :

tar -xvf lab07.tar
The reference program translate.org can be executed to see how your program is expected to behace for various test cases. You need to add your code to the file translate.c, and compile it by running "make" (The Makefile is already provided to you).

Input

The input file contains words separated by spaces and/or newlines. The words will only consist of alphanumeric characters. You can assume that there are not more than 1000 unique words in the input file. The input file name is passed as the first command line argument. Please execute the reference implementation (translate.org) to see what you need to do when the file name is not given or the file name given cannot be opened.

Assigning Ids to words

For each word in the input file, a unique id is assigned in a monotonically increasing order, with the first word in the input file getting a id of "1".
Note : Repeating occurences of the same word do not get a new id, a new id is assigned to a word only on its first occurence (Look at the example below to better understand the assignment of ids)

Output

The program needs to output two things :

  1. Create a file "output.txt". For each word in the input file write to "output.txt" the id for that word enclosed in angular brackets ("<id>"). Once completely written "output.txt" will contain just the uids for the words in the input file enclosed in angular brackets.
  2. Print to the screen words seen in the input file in the following format :

    <id> word word_count

    where word_count is the number of times word has occurred in the input file. Print one word per line.
Please look at the example below to understand better.

Example

Consider an input file "input.txt" which contains the following text. (Note : The input file contains only alphanumeric characters and the words are separated by spaces or newlines)

CS240 is interesting CS240 is the C programming
course C programming is interesting
The program can be run with this file as the input as follows :
$ ./translate input.txt
<1> CS240 2
<2> is 3
<3> interesting 2
<4> the 1
<5> C 2
<6> programming 2
<7> course 1

Here CS240 is the first word in the input file and hence gets an id of "1" and the consequent words get increasing ids. Observe that the second occurence of CS240 just increments the number of the occurences of the word and does affect the id already assigned to CS240.

The program would also create a file "output.txt" which will contain the following :
<1><2><3><1><2><4><5><6><7><5><6><2><3>
Please make sure that the file "output.txt" does not contain any stray characters apart from the ids of the words enclosed in angular brackets

Submit

Before you submit make sure to test your implementation on LORE using the Makefile provided.
Your code must compile using the provided Makefile and run on LORE for you to earn points for this lab.

Type cd .. in lab07 and change working directory to the parent directory of lab07.

In the parent directory of lab07, type turnin -v -c cs240=XXX -p lab07 lab07 to turnin your work. Replace XXX with your section number.

9:30 am - 11:20 am FF930
11:30 am - 1:20 pm FF1130
1:30 pm - 3:20 pm FF130
3:30 pm - 5:20 pm FF330
9:30 am - 11:20 am RR930
11:30 am - 1:20 pm RR1130
3:30 pm - 5:20 pm RR330
11:30 am - 1:20 pm TT1130

Now, you may use the command, turnin -c cs240=XXX -p lab07 -v to verify your submission.

This lab is due on Monday, April 11 by 11:59 pm