CS240 Project 4

Message Compression and Encryption

Motivation:

Efficiency and Security are two major issues in data communications. In order to improve performance in data transmission and also save band-width it helps to compress each message we transmit. In addition to message compression, it is possible to use cryptographic methods to further guarantee the safety of data by preventing the unauthorized interception and decoding of message information.

In information theory, there is a important concept called entropy. It is defined as

H = - p(c1)log2(p(c1)) - p(c2)log2(p(c2)) - ... - p(cn)log2(p(cn))

where p(ci) denotes the rate of character i in the message, n denotes the number of different characters and log2() denotes the logrithm function with base 2. For example, if the message is "aabc", then the number of character is 3, the rate of a is 2/4=0.5, the rate of b is 1/4=0.25 and the rate of c is 1/4=0.25. So the entropy of the message is

H = - 0.5log2(0.5) - 0.25log2(0.25) - 0.25log(0.25) = 1.5

The entropy of a message is important, because it is the lower bound (minimum) of the average code length to transmit the message without losing information. In C programming, one character occupies one char (8 bits). The average code length to transmit any message is therefore 8. In the example message "aabc", it is much bigger than the entropy 1.5. So it is obviously a waste. We may seek some methods to have lower average code length (get close to the lower bound) in order to compress the message. One of the method is Huffman coding. In Huffman coding, different characters have different lengths. The idea is as follows. If some character (say 'e' ) occurs frequently in the message, it makes sense to use fewer bits to represent it. Conversely, if some character (say '&') occurs only once or twice in the entire message text, we can afford to use more bits to represent it. Thus, the basic idea of Huffman encoding is to choose various bit-lengths for different characters depending on their frequency of occurrence in the text. Let's come back to the example message: by using Huffman coding, "a" will have the code "0", "b" will have the code "01" and "c" will have the code "11". So the average code length is

0.5*1 + 0.25*2 + 0.25*2 = 1.5

It is exactly the same as the entropy! So by using Huffman coding, we can achieve the lower bound of the average code length in some cases, though we cannot in general. And we do compress the message.

Once this encoding is done, a simple algorithm called RC4 can be used to encrypt the message. The basic principle of RC4 is: If A XOR B = C, then B XOR C = A. (XOR is the bit-wise exclusive-or operation, i.e., 1 XOR 1 = 0, 0 XOR 0 = 0, 1 XOR 0 = 1, 0 XOR 1 = 1). So if the user provides a secret key called K and the plaintext M, we can use some function to generate K' using K so that K' has the same length as M. Then we compute S = K' XOR M. S is the cipher text, meaning that it has been protected by this cryptographic encoding. To decrypt the message, we must first get K' using the key provided by the user, and then compute M = S XOR K'. In this way we recover M which is the original plaintext. The algorithm RC4 uses certain mechanisms to generate K' for every character, and so it actually does the XOR operation character by character.

In this project, you will write a program to implement Huffman coding and RC4. You will also explore the relationship between the entropy and the average code length obtained by Huffman coding.

Basic Knowledge:

How to build a Huffman coding table?

1. Count how many times each character appears in the input string. Because there are at most 256 possibilities for a char, so we can use an int[256] to record the frequency time for each character.

For example, if the input string is qzzzzqqdaaaqqbcdzdqqdabcazqqczcbddzqqq, then the character counts would be a:5,b:3,c:4,d:6,q:12,z:8 and zero for all other characters. Now, these six characters can be seen as the six tree nodes with their occurrence times in the nodes:

2. Select two nodes with the smallest frequency (b and c in this case) and combine them in a new binary tree. The frequency of the root of the binary tree is the sum of the two selected nodes. Now the entire tree can be seen as a new node with the frequency of 7.

Some ambiguities exist here. If there are 3 characters of 'a' , 'b', 'c' that all have the frequency of 4, which two of them should we select? If we select a certain two, which one of these two should be the left child of the newly created binary tree? To get around such problems we impose the following rules. The rules are meant to ensure that the final Huffman coding tree is unique.

Rule1: A node a is lighter than node b if node a has smaller frequency than node b. If they have the same frequency, then the node which has a smaller ASCII value is lighter. So 'a':4 is lighter than 'b':4, and 'z':2 is lighter than 'a':4. In this step, we always select the two lightest nodes.

Rule2: When we combine two nodes into a binary tree, the frequency of the root node of the tree is the sum of the frequencies of its two children. The ASCII value of the root node is the smaller of the ASCII value of the two children.

Rule3: When constructing the binary tree, we always place the lighter node on the left branch and the heavier node on the right branch. By applying these rules, nodes will look like this:

Next, we select the two lightest nodes, i.e., a:5 and d:6, and combine them again:

We repeat this step until there is only one tree left:

         

The final Huffman tree is:

      

The six leaves of the Huffman tree are the six original nodes in step 1, while all other nodes are intermediate nodes.

After the Huffman tree has been built, we can write the bitcode for a given character by following the path from the root of the tree to the leaf containing the character. The bit on each left branch is 0 while the bit on each right branch is 1.

So in the example given above, the Huffman code is as follow:

a:100  b:000  c:001  d:101  q:11  z:01 

Huffman Decoding

Huffman decoding is much simpler than encoding. Once you have built the tree, you read one bit at a time from the encoded message, and use these bits to walk down the binary tree starting at the root. Every time you read a "0" bit, move down to the left, and every time you read a "1" bit, move down to the right. When you reach a leaf, you know that you have just read the bitcode for the symbol corresponding to that leaf . Then, start back up at the root of the tree, and begin reading in bits again to decode the next symbol. Repeat this process until you have decoded all the symbols you are supposed to receive. If you run out of bits in the encoded file before you decode all the symbols you are expecting, then something is wrong! Also, if the bit you read tells you to go left/right but the left/right pointer of the current node is NULL, then something is wrong!

For example, given a bit stream of 1001101001000000111110010100011 and the Huffman tree shown above, we can decode it to characters:

1. The first three bits 100 tell us to go from the root to leaf a, so first character is a.

2. Starting from the root again, the next two bits tell us to reach the leaf of q, so the second character is q.

...

At the end we arrive at the string "aqzcbbqqadbq". All the bits in the bit stream are used up. There'll be no ambiguity in decoding because every character appears as a leaf in the Huffman tree.

RC4 Encryption:

You can find the details of RC4 at http://en.wikipedia.org/wiki/RC4

The implementation of RC4 is very simple. It has two phases:

At first phase, an integer array S[] is initialized according to the key the user inputs

for i from 0 to 255
    S[i] := i
endfor
j := 0
for i from 0 to 255
    j := (j + S[i] + key[i mod keylength]) mod 256
    swap(S[i],S[j])
endfor

At second phase, each input character is XORed with one element in S[].
i := 0
j := 0
k := 0
while k<input_length:
    i := (i + 1) mod 256
    j := (j + S[i]) mod 256
    swap(S[i],S[j])
    output S[(S[i] + S[j]) mod 256] ^ input[k++]
endwhile

The Output is the encrypted message.

You can use the Test Vectors from http://en.wikipedia.org/wiki/RC4 to test the correctness of your own RC4.

RC4 Decryption:

Decryption is exactly the same as Encryption due to the special feature of the XOR operation. By encrypting the encrypted message, we can get the decrypted message.

Sample Run:

The program should display a menu after being executed.

Please Selection an Option:

1. Compress and Encrypt Message

2. Decrypt and Uncompress Message

3. Exit

Compress and Encrypt:

If the user selects 1, then he is required to input the message to be compressed and encrypted, as well as the key for encryption:

Please Input the Message:
It will not return to me empty, but will accomplish what I desire and achieve the purpose for which I sent it.

Please Input the Key:
God

Then your program should build a Huffman tree according to the frequencies of characters in the message. The Huffman code for each character should be printed on the screen by ASCII value order:

HUFFMAN Code Table:
the character   has freq 21 and code is 00
the character , has freq 1 and code is 1000100
the character . has freq 1 and code is 1000101
the character I has freq 3 and code is 01100
the character a has freq 4 and code is 10100
the character b has freq 1 and code is 1000110
the character c has freq 4 and code is 10101
the character d has freq 2 and code is 101100
the character e has freq 10 and code is 1110
the character f has freq 1 and code is 1000111
the character h has freq 6 and code is 0111
the character i has freq 7 and code is 1001
the character l has freq 5 and code is 11011
the character m has freq 3 and code is 01101
the character n has freq 4 and code is 10111
the character o has freq 5 and code is 11110
the character p has freq 4 and code is 11000
the character r has freq 5 and code is 11111
the character s has freq 4 and code is 11001
the character t has freq 10 and code is 010
the character u has freq 3 and code is 10000
the character v has freq 1 and code is 1011010
the character w has freq 4 and code is 11010
the character y has freq 1 and code is 1011011

With the aid of the bitcodes in the Huffman code table you can encode the original message. We can't guarantee that the bit length of the encoded message is divisible by 8. If it is not, we have to use padding (i.e., add 0’s) to the end of the encoded message. For example, if the length of the encoded message is 52, we should add an extra 4 0’s to the end of the message so that it becomes a 56-bit message. After this padding is done print out the encoded message in hexadecimal mode.

The HUFFMAN encoded string is:
62353BD97F23FCA1FB8BC378E6E15B884682353BD94AD7CDC6E72E69E886167667FC52F614ABCF5AE13F1887F1ECF11FDF34F3571867AE895140

You also have to output the number of padding bits, the entropy of the message, the average code length of the message after Huffman coding and the compression ratio. The compression ratio is computed as (Length of the encoded message)/(length of the original message). To calculate the entropy, you may need to use the logrithm function. You can include math.h in your c file and add -lm at the end of the command line to link it when you compile your c file.

Padding Bits: 6

Entropy:4.118341

Average code length:4.163636

Compression Ratio:0.527273

Then RC4 algorithm is used to further encrypt the encoding result.


RC4 Encryption Result:
991E6DE06AEE87B462B06703AD76960079F404B01464D42D497238B36BFAA95B40EE7B0003A9F52DB19AEC81AE17B25619A81C25F346323EA2AC

Pay attention to the fact that the RC4 encryption operation does not change the length of the plaintext.

Decrypt and Uncompress:

If the user selects 2, he is first required to input the encrypted message in hexadecimal mode.

Please Input the Encrypted Message:
7ADA64A22F85A610610C68F98F3620857F9093D53566A6A17203404976B85AFF76ACCEAEDE8874ADBF9EF114016AD76CCC1DFB0E924DD4A132D105AD26A964EE8A3FF5AB651C2F77FE977BAAA57F085F59EA28824FB5FE117FDB010BF82191D0B10A62A54025A3E49AEDF450BFB0FB3352348E2E0896850231294D82EBFCF848D1B8F3E73072211DE37E500EC5A99D5F051919D126EAD60C79811EA27DC06A99114E167E7212D96C400461E1C650132A8414F6E96743AEF9ACC4AB95AE55D1DEB3435F62E9F68F23C32237ED39ACB16EA134

And then he has to input the key that is used in the encryption:

Please Input the Key:
Edward Field

Following this, the user should input the frequency table: the characters appear in the message and their frequencies.

Please Input the Frequency Table:
  70 " 2 ' 1 , 9 . 1 G 1 I 2 T 1 a 24 b 8 c 12 d 12 e 39 f 4 g 10 h 13 i 21 k 1 l 16 m 10 n 16 o 25 p 8 r 16 s 13 t 27 u 4 v 3 w 6 y 15

The frequency table should be input in the format of "[c1][space][number][space][c2][space][number][space][c3][space][number]..."   (c1 is the first character and c2 is the second... space is the only separator between various elements.)Notice that there are two spaces before "70" at the beginning of the frequency table in the example above. The first space is the character and the second is the separator.

Finally, the user should input the number of padding bits so that the program can figure out the length of the actual encoded message.

Please Input the Padding Number:
4

After getting everything we need, the program outputs the result of RC4 decryption and Huffman decoding.

RC4 Decryption Result:
C3D0E7D5F3EE87B40F5DB7BF49DC7D67DD089EB3A651F448C0C7DCFC4ED8E03EE855F1CF567BC2FCE8F63A6B6C09FA99D76515B3AEAC09F3A6FC876B47708F07BD22FBDE7B5F4BEB5F3A90EE11EB11F4
48C0C7DCF8388E7C8784CEF79F8E7AB3DE17E747C4B027B2B5237A4717C76515DB44F22617BCF60B9F743D77B23D72E2C09EACF83EB6EC184AFA1DA8DA177C0C09F73EE87D69227AED74EFEAF1F5536C
A16F0D331C727AB3E2E3B604D51D3331C727C4B8F90285B7AFBA1ED03D44DD7BA5D83CE023D59F42F1EBB5D3BF86DA0B5850

Original Text:
They say the ice will hold so there I go, forced to believe them by my act of trusting people,stepping out on it, and naturally it gaps open and I, forced to carry on coolly by my act of being imperturbable,slide erectly into the water wearing my captain's helmet,waving to the shore with a sad smile,"Goodbye my darlings, goodbye dear one,"as the ice meets again over my head with a click.

Test Cases:

Here (Unix) and here (Windows) are sample programs that you can run to have a general understanding of the project.

Here (Unix) and here (Windows) are input files of the two examples above that you can use to test your program.

You can download more test cases Here (Unix). If your program can pass these test cases, you are guaranteed to have at least 40% of the points on this project

You can also generate more test cases by the sample program.

Tips and Hints:

1. You can use a data structure called a Heap to build the Huffman tree. More information about Binary Heap can be found at http://en.wikipedia.org/wiki/Binary_heap.

A Binary Heap is actually a complete binary tree. We can use a min heap (a heap in which the value in each node is smaller than the values in its two children, so that the root node has the smallest value) to generate the Huffman tree. First, insert all the characters with their frequencies into the heap. Second, extract two characters from the heap --- you’ll automatically get the two lightest nodes. Create a new node by combining both these characters as children of the node and then insert the newly created node back into the heap. If there are n characters, do step 2 at total of n-1 times so that there will only be one element left in the heap --- the root node of the Huffman tree. Because the heap is a complete binary tree, we can use an array to represent the tree structure. Also, because there are at most 256 characters, the size of the array can be fixed at 256.

2. In the program, you will use malloc() a lot. Do not forget to free memory that you no longer need.

3. If you use printf("%x") to output the hexadecimal form, pay attention when the ASCII value of the character is 0 or less than 16.(In these two cases, printf("x") can miss a 0)

4. The Steps of compressing and encrypting the message should be the following:

   (a) build a frequency table

   (b) build a Huffman tree

   (c) generate the Huffman code for each character using the Huffman tree

   (d) replace the original character with the Huffman code.

   (e) pad the encoding result with 0's if necessary

   (f) encrypt the result of step (e) using RC4

5. When you read the input from the user, you may need to enlarge the space storing the user input dynammically. This is because the input may be huge.

6. Build your project by several distinct modules, one for the Heap operation, one for the Huffman tree structure, one for the RC4 implementation, etc. Write and test these separately before you combine them together. This is the proper way to develop large projects.

7. You can assume the input is always correct. You do not have to deal with invalid input. Do not use fflush(stdin) which can cause trouble when redirecting input to a file.

Grading and Requirements

You're free to use any advanced functions you like in C. However, make sure that you really understand their usage before invoking them.

We will be using the make command to build your project. Once the grader types in make and enter, your program should be built and ready to execute. A requirement to use any other method such as command line gcc due to unavailability of correct Makefile will result in a deduction of points from your grade. The output of make should be an executable named project4.out.

When we grade, we will redirect (recall the '<' '>' operator under Unix command window) your program to read a series of expressions from a file (one on each line), and redirect your output to another file. We will then compare this output file with our results file. You can also use this technique when doing your own testing.

Modularizing your program as described earlier not only helps you to solve the problem easily but also helps you to get partial credit. When we grade we will have different test cases for the aspects specified in the specification. So if your program passes some test cases but not all you will get the points assigned to those test cases. But it is important to remember that your program should compile without errors to get these partial credits.

Proper use of comments and documentation are an important part of the software development process -- for your own understanding, and to enable someone else to understand your code. This also aids in debugging and maintenance. Try to develop good programming habits early on, and this will bear fruit later when you work with large projects. Finally, care with coding, comments and documentation etc. may help you get partial credit.

This is an individual project, and we will use software to detect similarities between submitted source codes. If any student is caught copying source codes from others, the student(s) will score zero mark in this project, and proper actions will be carried out.

Turnin

When you are satisfied that your program works correctly (or you run out of time), please do the following.

  1. Create a directory named project4

  2. Copy all program files ( *.c and *.h ) and your Makefile to the directory created in step 1.

  3. Goto the parent directory of project4 and type the following command into the terminal:

    turnin -c cs240=XXXX -p project4 project4

    where XXXX represents your lab section number.

    The turnin section should be as follows:

    Section Time TA
    0201 Thursday 15:30-17:20 Dan Zhang
    0301 Friday 09:30-11:20 Suli Xi
    0401 Friday 13:30-15:20 Youhan Fang
    0501 Thursday 09:30-11:20 J.C.Chin

    Make sure turnin reports that your project was submitted for grading. You can check the files you have submitted by running the following command:

    turnin -c cs240=XXXX -v -p project4

  4. Submit often to make sure that you did not try to submit after turnin has been turned off.

Grading Breakdown

5 points Makefile created and submitted
5 points Program can quit successfully
10 points Interface
80 points 8 test cases

 


Youhan Fang 2009-10-29