霍夫曼编码压缩算法 Huffman coding on a string

展开全文

前两天发布那个rsync算法后，想看看数据压缩的算法，知道一个经典的压缩算法Huffman算法。相信大家应该听说过 David Huffman 和他的压缩算法—— Huffman Code，一种通过字符出现频率，Priority Queue，和二叉树来进行的一种压缩算法，这种二叉树又叫Huffman二叉树 —— 一种带权重的树。从学校毕业很长时间的我忘了这个算法，但是网上查了一下，中文社区内好像没有把这个算法说得很清楚的文章，尤其是树的构造，而正好看到一篇国外的文章《A Simple Example of Huffman Code on a String》，其中的例子浅显易懂，相当不错，我就转了过来。注意，我没有对此文完全翻译。

我们直接来看示例，如果我们需要来压缩下面的字符串：

“beep boop beer!”

首先，我们先计算出每个字符出现的次数，我们得到下面这样一张表 :

然后，我把把这些东西放到Priority Queue中（用出现的次数据当 priority），我们可以看到，Priority Queue 是以Prioirry排序一个数组，如果Priority一样，会使用出现的次序排序：下面是我们得到的Priority Queue：

接下来就是我们的算法——把这个Priority Queue 转成二叉树。我们始终从queue的头取两个元素来构造一个二叉树（第一个元素是左结点，第二个是右结点），并把这两个元素的priority相加，并放回Priority中（再次注意，这里的Priority就是字符出现的次数），然后，我们得到下面的数据图表：

同样，我们再把前两个取出来，形成一个Priority为2+2=4的结点，然后再放回Priority Queue中 :

继续我们的算法（我们可以看到，这是一种自底向上的建树的过程）：

最终我们会得到下面这样一棵二叉树：

此时，我们把这个树的左支编码为0，右支编码为1，这样我们就可以遍历这棵树得到字符的编码，比如：‘b’的编码是 00，’p’的编码是101， ‘r’的编码是1000。我们可以看到出现频率越多的会越在上层，编码也越短，出现频率越少的就越在下层，编码也越长。

最终我们可以得到下面这张编码表：

这里需要注意一点，当我们encode的时候，我们是按“bit”来encode，decode也是通过bit来完成，比如，如果我们有这样的bitset “1011110111″ 那么其解码后就是 “pepe”。所以，我们需要通过这个二叉树建立我们Huffman编码和解码的字典表。

这里需要注意的一点是，我们的Huffman对各个字符的编码是不会冲突的，也就是说，不会存在某一个编码是另一个编码的前缀，不然的话就会大问题了。因为encode后的编码是没有分隔符的。

于是，对于我们的原始字符串 beep boop beer!

其对就能的二进制为 : 0110 0010 0110 0101 0110 0101 0111 0000 0010 0000 0110 0010 0110 1111 0110 1111 0111 0000 0010 0000 0110 0010 0110 0101 0110 0101 0111 0010 0010 0001

我们的Huffman的编码为： 0011 1110 1011 0001 0010 1010 1100 1111 1000 1001

从上面的例子中，我们可以看到被压缩的比例还是很可观的。

作者给出了源码你可以看看（ C99标准） Download the source files
源码：

huffman.cpp：

#include <stdlib.h>
#include <stdio.h>
#include <string.h>

#include "huffman.h"
#include "pQueue.h"

void traverseTree(htNode *treeNode, hlTable **table, int k, char code[256])
{
//If we reach the end we introduce the code in the table
if(treeNode->left == NULL && treeNode->right == NULL)
{
  code[k] = '\0';
  hlNode *aux = (hlNode *)malloc(sizeof(hlNode));
  aux->code = (char *)malloc(sizeof(char)*strlen(code));
  strcpy(aux->code,code);
  aux->symbol = treeNode->symbol;
  aux->next = NULL;
  if((*table)->first == NULL)
  {
   (*table)->first = aux;
   (*table)->last = aux;
  }
  else
  {
   (*table)->last->next = aux;
   (*table)->last = aux;
  }

}

//We concatenate a 0 for each step to the left
if(treeNode->left!=NULL)
{
  code[k]='0';
  traverseTree(treeNode->left,table,k+1,code);

}
//We concatenate a 1 for each step to the right
if(treeNode->right!=NULL)
{
  code[k]='1';
  traverseTree(treeNode->right,table,k+1,code);

}
}

hlTable * buildTable(htTree * huffmanTree)
{
//We initialize the table
hlTable *table = (hlTable *)malloc(sizeof(hlTable));
table->first = NULL;
table->last = NULL;

//Auxiliary variables
char code[256];
//k will memories the level on which the traversal is
int k=0;

//We traverse the tree and calculate the codes
traverseTree(huffmanTree->root,&table,k,code);
return table;
}

htTree * buildTree(char *inputString)
{
//The array in which we calculate the frequency of the symbols
//Knowing that there are only 256 posibilities of combining 8 bits
//(256 ASCII characters)
int * probability = (int *)malloc(sizeof(int)*256);

//We initialize the array
for(int i=0; i<256; i++)
probability[i]=0;

//We consider the symbol as an array index and we calculate how many times each symbol appears
for(int i=0; inputString[i]!='\0'; i++)
probability[(unsigned char) inputString[i]]++;

//The queue which will hold the tree nodes
pQueue * huffmanQueue;
initPQueue(&huffmanQueue);

//We create nodes for each symbol in the string
for(int i=0; i<256; i++)
  if(probability[i]!=0)
  {
   htNode *aux = (htNode *)malloc(sizeof(htNode));
   aux->left = NULL;
   aux->right = NULL;
   aux->symbol = (char) i;

   addPQueue(&huffmanQueue,aux,probability[i]);
  }

//We free the array because we don't need it anymore
free(probability);

//We apply the steps described in the article to build the tree
while(huffmanQueue->size!=1)
{
int priority = huffmanQueue->first->priority;
priority+=huffmanQueue->first->next->priority;

htNode *left = getPQueue(&huffmanQueue);
htNode *right = getPQueue(&huffmanQueue);

  htNode *newNode = (htNode *)malloc(sizeof(htNode));
  newNode->left = left;
  newNode->right = right;

addPQueue(&huffmanQueue,newNode,priority);
}

//We create the tree
htTree *tree = (htTree *) malloc(sizeof(htTree));

tree->root = getPQueue(&huffmanQueue);

return tree;
}

void encode(hlTable *table, char *stringToEncode)
{
hlNode *traversal;

printf("\nEncoding\nInput string : %s\nEncoded string : \n",stringToEncode);

//For each element of the string traverse the table
//and once we find the symbol we output the code for it
for(int i=0; stringToEncode[i]!='\0'; i++)
{
  traversal = table->first;
  while(traversal->symbol != stringToEncode[i])
   traversal = traversal->next;
  printf("%s",traversal->code);
}

printf("\n");
}

void decode(htTree *tree, char *stringToDecode)
{
htNode *traversal = tree->root;

printf("\nDecoding\nInput string : %s\nDecoded string : \n",stringToDecode);

//For each "bit" of the string to decode
//we take a step to the left for 0
//or ont to the right for 1
for(int i=0; stringToDecode[i]!='\0'; i++)
{
  if(traversal->left == NULL && traversal->right == NULL)
  {
   printf("%c",traversal->symbol);
   traversal = tree->root;
  }

  if(stringToDecode[i] == '0')
   traversal = traversal->left;

if(stringToDecode[i] == '1')
traversal = traversal->right;

  if(stringToDecode[i]!='0'&&stringToDecode[i]!='1')
  {
   printf("The input string is not coded correctly!\n");
   return;
  }
}

if(traversal->left == NULL && traversal->right == NULL)
{
printf("%c",traversal->symbol);
traversal = tree->root;
}

printf("\n");
}

huffman.h：

#pragma once
#ifndef _HUFFMAN_H
#define _HUFFMAN_H

//The Huffman tree node definition
typedef struct _htNode {
char symbol;
struct _htNode *left, *right;
}htNode;

/*
We "encapsulate" the entire tree in a structure
because in the future we might add fields like "size"
if we need to. This way we don't have to modify the entire
source code.
*/
typedef struct _htTree {
htNode *root;
}htTree;

//The Huffman table nodes (linked list implementation)
typedef struct _hlNode {
char symbol;
char *code;
struct _hlNode *next;
}hlNode;

//Incapsularea listei
typedef struct _hlTable {
hlNode *first;
hlNode *last;
}hlTable;

htTree * buildTree(char *inputString);
hlTable * buildTable(htTree *huffmanTree);
void encode(hlTable *table, char *stringToEncode);
void decode(htTree *tree, char *stringToDecode);

#endif

main.cpp：

#include <stdio.h>
#include <stdlib.h>
#include "huffman.h"

int main(void)
{
//We build the tree depending on the string
htTree *codeTree = buildTree("beep boop beer!");
//We build the table depending on the Huffman tree
hlTable *codeTable = buildTable(codeTree);

//We encode using the Huffman table
encode(codeTable,"beep boop beer!");
//We decode using the Huffman tree
//We can decode string that only use symbols from the initial string
decode(codeTree,"0011111000111");
//Output : 0011 1110 1011 0001 0010 1010 1100 1111 1000 1001
return 0;
}

pQueue.cpp：

#include "pQueue.h"
#include <stdlib.h>
#include <stdio.h>

void initPQueue(pQueue **queue)
{
//We allocate memory for the priority queue type
//and we initialize the values of the fields
(*queue) = (pQueue *) malloc(sizeof(pQueue));
(*queue)->first = NULL;
(*queue)->size = 0;
return;
}
void addPQueue(pQueue **queue, TYPE val, unsigned int priority)
{
//If the queue is full we don't have to add the specified value.
//We output an error message to the console and return.
if((*queue)->size == MAX_SZ)
{
printf("\nQueue is full.\n");
return;
}

pQueueNode *aux = (pQueueNode *)malloc(sizeof(pQueueNode));
aux->priority = priority;
aux->val = val;

//If the queue is empty we add the first value.
//We validate twice in case the structure was modified abnormally at runtime (rarely happens).
if((*queue)->size == 0 || (*queue)->first == NULL)
{
  aux->next = NULL;
  (*queue)->first = aux;
  (*queue)->size = 1;
  return;
}
else
{
  //If there are already elements in the queue and the priority of the element
  //that we want to add is greater than the priority of the first element,
  //we'll add it in front of the first element.

  //Be careful, here we need the priorities to be in descending order
  if(priority<=(*queue)->first->priority)
  {
   aux->next = (*queue)->first;
   (*queue)->first = aux;
   (*queue)->size++;
   return;
  }
  else
  {
   //We're looking for a place to fit the element depending on it's priority
   pQueueNode * iterator = (*queue)->first;
   while(iterator->next!=NULL)
   {
    //Same as before, descending, we place the element at the begining of the
    //sequence with the same priority for efficiency even if
    //it defeats the idea of a queue.
    if(priority<=iterator->next->priority)
    {
     aux->next = iterator->next;
     iterator->next = aux;
     (*queue)->size++;
     return;
    }
    iterator = iterator->next;
   }
   //If we reached the end and we haven't added the element,
   //we'll add it at the end of the queue.
   if(iterator->next == NULL)
   {
     aux->next = NULL;
     iterator->next = aux;
     (*queue)->size++;
     return;
   }
  }
}
}

TYPE getPQueue(pQueue **queue)
{
TYPE returnValue;
//We get elements from the queue as long as it isn't empty
if((*queue)->size>0)
{
  returnValue = (*queue)->first->val;
  (*queue)->first = (*queue)->first->next;
  (*queue)->size--;
}
else
{
  //If the queue is empty we show an error message.
  //The function will return whatever is in the memory at that time as returnValue.
  //Or you can define an error value depeding on what you choose to store in the queue.
  printf("\nQueue is empty.\n");
}
return returnValue;
}

pQueue.h：

#pragma once
#ifndef _PQUEUE_H
#define _PQUEUE_H

#include "huffman.h"

//We modify the data type to hold pointers to Huffman tree nodes
#define TYPE htNode *

#define MAX_SZ 256

typedef struct _pQueueNode {
TYPE val;
unsigned int priority;
struct _pQueueNode *next;
}pQueueNode;

typedef struct _pQueue {
unsigned int size;
pQueueNode *first;
}pQueue;

void initPQueue(pQueue **queue);
void addPQueue(pQueue **queue, TYPE val, unsigned int priority);
TYPE getPQueue(pQueue **queue);

#endif

源文

A simple example of Huffman coding on a string

By plesciuc.sebastian | May 19, 2012 - 8:50 pm | C, Programming

You’ve probably heard about David Huffman and his popular compression algorithm. If you didn’t, you’ll find that info on the Internet. I will not bore you with history or math lessons in this article. I’m going to try to show you a practical example of this algorithm applied to a character string. This application will only generate console output representing the code values for the symbols inputted and generate the original symbols from a given code.

The source code attached to this article will show you how Huffman Coding works so you will get a basic understanding of it. This is for the people who have difficulty understanding the mathematics of it. In a future article (I hope) we’ll be talking about how to apply this to any files to produce their compressed format. (A simple file archiver like WinZip or WinRAR.)

The idea behind Huffman coding is based upon the frequency of a symbol in a sequence. The symbol that is the most frequent in that sequence gets a new code that is very small, the least frequent symbol will get a code that is very long, so that when we’ll translate the input we want to encode the most frequent symbols will take less space than they used to and the least frequent symbols will take more space but because they’re less frequent it won’t matter that much. For this application I chose the symbol to be 8 bits long so that the symbol will be a character (char).

We could just as easily have chosen the symbol to be 16 bits long, so we could have grouped 2 characters together as a symbol or 10 bits or 20 etc. Depending on the input we expect to have, we’ll chose the size of the symbol and the way we use it. For example, if I expect to encode raw video files, I’ll chose the symbol to be the size of a pixel. Keep in mind that when increasing or decreasing the size of the symbol, it will affect the size of the code for each symbol because the bigger the size, the more symbols you can have of that size. There are less ways to write the ones and zeroes on 8 bits than there are on 16 bits. You’ll want to adjust the size of the symbol depending on how the ones and zeroes are likely to repeat themselves in a sequence.

For this algorithm you need to have a basic understanding of binary tree data structure and the priority queue data structure. In the source code we’ll actually use the priority queue code available in a previous article.

Let’s say we have the string “beep boop beer!” which in his actual form, occupies 1 byte of memory for each character. That means that in total, it occupies 15*8 = 120 bits of memory. Through encoding, the string will occupy 40 bits. (Theoretically, in this application we’ll output to the console a string of 40 char elements of 0 and 1 representing the encoded version of the string in bits. For this to occupy 40 bits we need to convert that string directly into bits using logical bit operations which we’ll not discuss now.)

To better understand this example, we’ll going to apply it on an example. The string “beep boop beer!” is a very good example to illustrate this. In order to obtain the code for each element depending on it’s frequency we’ll need to build a binary tree such that each leaf of the tree will contain a symbol (a character from the string). The tree will be build from the leafs to the root, meaning that the elements of least frequency will be farther from the root than the elements that are more frequent. You’ll see soon why we chose to do this.

To build the tree this way we’ll use a priority queue with a slight modification, that the element with the least priority is the most important. Meaning that the elements that are the least frequent will be the first ones we get from the queue. We need to do this so we can build the tree from the leaves to the root.

Firstly we calculate the frequency of each character :

Character	Frequency
‘b’	3
‘e’	4
‘p’	2
‘ ‘	2
‘o’	2
‘r’	1
‘!’	1

After calculating the frequencies, we’ll create binary tree nodes for each character and we’ll introduce them in the priority queue with the frequency as priority :

We now get the first two elements from the queue and create a link between them by creating a new binary tree node to have them both as successors, so that the characters are siblings and we add their priorities. After that we add the new node we created with the sum of the priorities of it’s successors as it’s priority in the queue. (The numbers represent the priority, i.e. their frequency.)

We repeat the same steps and we get the following :

Now after we link the last two elements we’ll get the final tree :

Now, to obtain the code for each symbol we just need to traverse the trees until we get to that symbol and after each step we take to the left we add a 0 to the code or 1 if we go right.

If we do this, we’ll get the following codes :

Character	Code
‘b’	00
‘e’	11
‘p’	101
‘ ‘	011
‘o’	010
‘r’	1000
‘!’	1001

To decode a string of bits we just need to traverse the tree for each bit, if the bit is 0 we take a left step and if the bit is 1 we take a right step until we hit a leaf (which is the symbol we are looking for). For example, if we have the string “101 11 101 11″ and our tree, decoding it we’ll get the string “pepe”.

It’s very important to observe that not one code is a prefix of another code for another symbol. In our example, if 00 is the code for ‘b’, 000 cannot be a code for any other symbol because there’s going to be a conflict. We’ll never reach that symbol because after taking steps for the first two bits we’ll get ‘b’, we’re never going the find the symbol for 000.

A practical aspect of implementing this algorithm is considering to build a Huffman table as soon as we have the tree. The table is basically a linked list or an array that contains each symbol with it’s code because it will make encoding something more efficient. It’s hard to look for a symbol by traversing a tree and at the same time calculating it’s code because we don’t know where exactly in the tree is that symbol located. As a principle, we use a Huffman table for encoding and a Huffman tree for decoding.

The input string : beep boop beer!

The input string in binary : 0110 0010 0110 0101 0110 0101 0111 0000 0010 0000 0110 0010 0110 1111 0110 1111 0111 0000 0010 0000 0110 0010 0110 0101 0110 0101 0111 0010 0010 0001

The encoded string : 0011 1110 1011 0001 0010 1010 1100 1111 1000 1001

As you can see there is a major difference in the ASCII version of the string and the Huffman coded version.

The source code behaves as described above. You’ll find more details in the comments in the code.

All the sources have been compiled and verified using the C99 standard. Happy programming.

Download the source files

To be clearer since there have been comments on this. This article only illustrates how the algorithm works. To use this in a real scenario you need to add the Huffman tree you created at the end or at the start of your encoded string and the reciever should know how to interpret it in order to decode the message. A good way to do this is to build a string of bits by traversing the tree in any order you want (I prefer post-order) and concatenate a 0 for each node that is not a leaf and a 1 for each node that is a leaf followed by the bits representing the original symbol (in our case, 8 bits representing the ASCII char value). Placing this string of bits at the start of your encoded message is ideal. Once the reciever has build the tree, he will know how to decode the message to obtain the original one.

To read about a more advanced version of this algorithm, you can read our article on Adaptive Huffman Coding.