戴维·霍夫曼,英文全名David Albert Huffman(August 9, 1925 – October 7, 1999) 。
对于学过数据结构的人来说,Huffman这个名字应该不会陌生吧。David Huffman教授1999年10月7日逝世。在他的一生中,他对于有限状态自动机,开关电路,异步过程和信号设计有杰出的贡献。但是我们只是通过数据结构中的Huffman编码才了解这位杰出的科学家的。
他发明的Huffman编码能够使我们通常的数据传输数量减少到最小。这个编码的发明和这个算法一样十分引人入胜。1950年,Huffman在MIT的信息理论与编码研究生班学习。Robert Fano教授让学生们自己决定是参加期未考试还是做一个大作业。而Huffman选择了后者,原因很简单,因为解决一个大作业可能比期未考试更容易通过。这个大作业促使了Huffman以后算法的诞生。
离开MIT后,Huffman来到University of California的计算机系任教,并为此系的学术做出了许多杰出的工作。而他的算法也广泛应用于传真机,图象压缩和计算机安全领域。但是Huffman却从未为此算法申请过专利或其它相关能够为他带来经济利益的东西,他将他全部的精力放在教学上,以他自己的话来说,“我所要带来的就是我的学生。”
个人简介编辑本段回目录
Throughout his life, Huffman made significant contributions to the study of finite state machines, switching circuits, synthesis procedures, and signal designs. However, David Huffman is best known for his legendary Huffman code, a compression scheme for lossless variable length encoding. It was the result of a term paper he wrote while a graduate student at the Massachusetts Institute of Technology (MIT), where he earned a D.Sc. degree on a thesis named The Synthesis of Sequential Switching Circuits, advised by Samuel H. Caldwell (1953).[1]
"Huffman Codes" are used in nearly every application that involves the compression and transmission of digital data, such as fax machines, modems, computer networks, and high-definition television (HDTV), to name a few.
A native of Ohio, Huffman earned his B.S. in electrical engineering from Ohio State University at the age of 18 in 1944. He then served in the U.S. Navy as a radar maintenance officer on a destroyer that helped to clear mines in Japanese and Chinese waters after World War II. He subsequently earned his M.S. degree from Ohio State in 1949 and his Ph.D. from MIT in 1953, also in electrical engineering.
Huffman joined the faculty at MIT in 1953. In 1967, he went to University of California, Santa Cruz as the founding faculty member of the Computer Science Department. He played a major role in the development of the department's academic programs and the hiring of its faculty, and served as chair from 1970 to 1973. He retired in 1994, but remained active as an emeritus professor, teaching information theory and signal analysis courses.
Huffman made important contributions in many other areas, including information theory and coding, signal designs for radar and communications applications, and design procedures for asynchronous logical circuits. As an outgrowth of his work on the mathematical properties of "zero curvature Gaussian" surfaces, Huffman developed his own techniques for folding paper into unusual sculptured shapes (which gave rise to the field of computational origami).
Huffman's accomplishments earned him numerous awards and honors. Most recently, he received the 1999 Richard Hamming Medal from the Institute of Electrical and Electronics Engineers (IEEE) in recognition of his exceptional contributions to information sciences. He also received the Louis E. Levy Medal of the Franklin Institute for his doctoral thesis on sequential switching circuits, a Distinguished Alumnus Award from Ohio State University, and the W. Wallace McDowell Award. He was a charter recipient of the Computer Pioneer Award from the IEEE Computer Society, and he received a Golden Jubilee Award for Technological Innovation from the IEEE Information Theory Society in 1998.
David Huffman died in 1999 after a 10-month battle with cancer. He was survived by his wife, Marilyn Huffman, of Santa Cruz; his former wife, Jane Ayres Huffman; their three children, Elise, Linda, and Stephen Huffman, all of Santa Cruz; a son-in-law, Jeff Grubb, of Santa Cruz; a stepdaughter, Marti Homer Kehlet, of Sacramento, her husband, Daret, and their daughter, Karsen; a stepson, Darin Homer of Prunedale, his wife, Jane, and their son, Ryan; and a brother, Donald Huffman, of Westerville, Ohio, his wife, Jean, and their family.
Huffman never tried to patent an invention from his work. Instead, he concentrated his efforts on education. In Huffman's words, "My products are my students."
While a graduate student at MIT, he was a student in Claude Shannon's class. Huffman was not a very disciplined student and was not really into studies. During the end of the semester, Shannon gave an option to students that anyone who comes up with the best possible codes will get an A+ in the course. Huffman realized that this was the only way to get an A+ in the course. He set about working to discover the most optimal code rather than studying. He did discover what we know as Huffman codes and got an A+ in the course.
他发明了著名的“霍夫曼编码”编辑本段回目录
著名的“霍夫曼编码”的发明人戴维·霍夫曼(David Albert Huffman)已于1999年10月17日因癌症去世,享年74岁。但他作为信息论的先驱,对计算机科学、通信等学科所作出的巨大贡献将永远为世人所铭记。
霍夫曼1925年生于俄亥俄州,从小聪慧好学,他在俄亥俄州立大学毕业时只有17岁。然后他进入MIT一边工作,一边深造,霍夫曼编码就是他在1952年做博士论文时发明的。这是一种根据字母的使用频率而设计的变长码,能大大提高信息的传输效率,至今仍有广泛的应用。为了说明霍夫曼编码的原理,我们以仅有6个字符的信息传输为例。在等长编码情况下,每个字符需要3个二进制位。在字符出现概率不同的情况下,我们可以设计出不等长的码,如图所示,从而大大降低发送一个讯息所需要的总字符数。因为如图所示的编码,虽然每个字符所需的二进制位的算术平均数为3.33,大于等长编码,但按出现概率的加权平均值,则为2.05,通信开销减少将近1/3。
变长码的关键问题是如何为每个字符设计编码,使得接收方在收到报文后能正确地译码,也就是能正确判断现在收到的一个二进制位是前一字符的末位还是一个新的字符的首位。在上面这个6个字符的简单例子中,判断当然是很容易的:每出现一个“0”,或连续出现的第5个“1”,就是一个字符的终止位。对于实际系统,字符集相当大,编码设计就没有那么容易而变得十分复杂了。但是霍夫曼成功地解决了这个难题,使变长码得以被实际采用。
霍夫曼编码方法的具体过程是:首先把信源的各个输出符号序列按概率递降的顺序排列起来,求其中概率最小的两个序列的概率之和,并把这个概率之和看做是一个符号序列的概率,再与其他序列依概率递降顺序排列(参与求概率之和的这两个序列不再出现在新的排列之中)。然后,对参与概率求和的两个符号序列分别赋予二进制数字0和1。继续这样的操作,直到剩下一个以1为概率的符号厅列。最后,按照与编码过程相反的顺序读出各个符号序列所对应的二进制数字组,就可分别得到各该符号序列的码字。
例如,某个离散无记忆信源的输出符号序列及其对应的概率分布为。对这些输出符号序列进行霍夫曼编码的具体步骤和结果如图所示。
除了霍夫曼编码外,霍夫曼在其他方面也还有不少创造,比如他设计的二叉最优搜索树算法就被认为是同类算法中效率最高的,因而被命名为霍夫曼算法,是动态规划(dynamic programming)的一个范例。
霍夫曼在MIT一直工作到1967年。之后他转入加州大学的Santa Cruz分校,是该校计算机科学系的创始人,1970—1973年任系主任。1994年霍夫曼退休。
霍夫曼除了获得计算机先驱奖以外,还在1973年获得IEEE的McDowell奖。1998年IEEE下属的信息论分会为纪念信息论创立50周年,授予他Golden Jubilee奖。霍夫曼去世前不久的1999年6月,他还荣获以哈明码发明人命名的哈明奖章(Hamming Medal)。
关于Huffman 压缩编辑本段回目录
0.原理
Huffman编码是一种可变长编码方式,是由美国数学家David Huffman创立的,是二叉树的一种特殊转化形式。编码的原理是:将使用次数多的代码转换成长度较短的代码,而使用次数少的可以使用较长的编码,并且保持编码的唯一可解性。Huffman算法的最根本的原则是:累计的(字符的统计数字*字符的编码长度)为最小,也就是权值(字符的统计数字*字符的编码长度)的和最小。
1.Huffman树
Huffman树是二叉树的一种特殊转化形式。以下是构件Huffman树的例子:
比如有以下数据, ABFACGCAHGBBAACECDFGFAAEABBB
先进行统计A(8) B(6) C(4) D(1) E(2) F(3) G(3) H(1) 括号里面的是统计次数
生成Huffman树:每次取最小的那两个节点(node)合并成一个节点(node),并且将累计数值相加作为新的接点的累计数值,最顶层的是根节点(root) 注:列表中最小节点的是指包括合并了的节点在内的所有节点,已经合并的节点不在列表中
运算的过程如下:
1:D+H(2)
2:DE+H(4)
3:F+G(6)
4:C+DEH(8)
5:B+FG(12)
6:A+CDEH(16)
7:ACDEH+BFG(28)
那么转化为Huffman树就是
Huffman树 层数
Root
┌┴┐
ACDEH BFG 1
┌┴┐┌┴┐
CDEH A B FG 2
┌┴┐ ┌┴┐
DEH C F G 3
┌┴┐
DH E 4
┌┴┐
D H 5
取左面是1 右面是0 则有。
注:层数就是位数或者说是代码长度,权值=代码长度*该字的统计次数。
代码 位数 权值
A = 10 2 16
B = 01 2 12
C = 110 3 12
D = 11111 5 5
E = 1110 4 8
F = 001 3 9
G = 000 2 6
H = 11110 5 5
可以看出Huffman代码是唯一可解的(uniquely decodable),如果你读到110就一定是C ,不会有任何一个编码是以110开始的。
如果不使用Huffman算法,而使用普通的编码,结果是什么呢?
Huffman树 层数
Root
┌┴┐
ABCD EFGH 1
┌┴┐ ┌┴┐
AB CD EF GH 2
┌┴┐┌┴┐┌┴┐ ┌┴┐
A B C D E F G H 3
取左面是1 右面是0 则有
代码 位数 权值
A = 111 3 24
B = 110 3 18
C = 101 3 12
D = 100 3 3
E = 011 3 6
F = 010 3 9
G = 001 3 9
H = 000 3 3
利用Huffman编码得到的权值累计是 73,如果利用普通定长编码的话,则要用84字符长度。从这个比较,你可以看出,Huffman是怎么进行压缩的。
2.编码和解码
编码:将ABCDEFGH用Huffman树产生的编码对应着写到文件中,并且保留原始的Huffman树,主要是编码段的信息。一般要编码256个元素的话需要511个单位来储存Huffman树,每个Huffman树都必须有以下的结构:code,char,left,right,probability(出现次数),通常情况是利用一个数组结构。因为在解码的时候只需要用到code,所以只需要记录每个元素的编码就可以了。
解码:利用文件中保存的Huffman编码,一一对应,解读编码,把可变长编码转换为定长编码。
3.发展
由于Huffman编码需要扫描两次,第一次是统计数字,第二次是编码写文件,大大影响了速度,因此有人发明了enhanced Huffman aglorithm。这种算法只扫描一遍文件,动态产生Huffman树,即每读n个字节就重新编码一次Huffman树,以达到提高速度的目的。在解码的过程中使用动态还原技术。