Information Theory, Inference, and Learning - Inference Group

Loading...
Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

Information Theory, Inference, and Learning Algorithms

David J.C. MacKay

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

Information Theory, Inference, and Learning Algorithms

David J.C. MacKay [email protected] c

1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005 c

Cambridge University Press 2003

Version 7.2 (fourth printing) March 28, 2005 Please send feedback on this book via http://www.inference.phy.cam.ac.uk/mackay/itila/ Version 6.0 of this book was published by C.U.P. in September 2003. It will remain viewable on-screen on the above website, in postscript, djvu, and pdf formats. In the second printing (version 6.6) minor typos were corrected, and the book design was slightly altered to modify the placement of section numbers. In the third printing (version 7.0) minor typos were corrected, and chapter 8 was renamed ‘Dependent random variables’ (instead of ‘Correlated’). In the fourth printing (version 7.2) minor typos were corrected. (C.U.P. replace this page with their own page ii.)

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

Contents

1 2 3

I

Data Compression 4 5 6 7

II

IV

. . . .

v 3 22 48

. . . . . . . . . . . . . . . . . . . . . .

65

The Source Coding Theorem . . Symbol Codes . . . . . . . . . . Stream Codes . . . . . . . . . . . Codes for Integers . . . . . . . .

Noisy-Channel Coding 8 9 10 11

III

Preface . . . . . . . . . . . . . . . . . . . Introduction to Information Theory . . . Probability, Entropy, and Inference . . . . More about Inference . . . . . . . . . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

137

. . . .

. . . .

. . . .

. . . . . . . . . . . . . . . . . . . . . . . .

. . . .

. . . .

67 91 110 132

. . . .

. . . .

. . . .

. . . .

Dependent Random Variables . . . . . . . . . . . Communication over a Noisy Channel . . . . . . The Noisy-Channel Coding Theorem . . . . . . . Error-Correcting Codes and Real Channels . . .

. . . .

. . . .

. . . .

138 146 162 177

Further Topics in Information Theory . . . . . . . . . . . . .

191

12 13 14 15 16 17 18 19

. . . . . . . .

193 206 229 233 241 248 260 269

Probabilities and Inference . . . . . . . . . . . . . . . . . .

281

20 21 22 23 24 25 26 27

284 293 300 311 319 324 334 341

Hash Codes: Codes for Efficient Information Retrieval Binary Codes . . . . . . . . . . . . . . . . . . . . . . Very Good Linear Codes Exist . . . . . . . . . . . . . Further Exercises on Information Theory . . . . . . . Message Passing . . . . . . . . . . . . . . . . . . . . . Communication over Constrained Noiseless Channels Crosswords and Codebreaking . . . . . . . . . . . . . Why have Sex? Information Acquisition and Evolution

An Example Inference Task: Clustering . Exact Inference by Complete Enumeration Maximum Likelihood and Clustering . . . Useful Probability Distributions . . . . . Exact Marginalization . . . . . . . . . . . Exact Marginalization in Trellises . . . . Exact Marginalization in Graphs . . . . . Laplace’s Method . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . .

. . . .

. . . . . .

. . . . . . . .

. . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

28 29 30 31 32 33 34

Model Comparison and Occam’s Razor . . . . . . . . . . . Monte Carlo Methods . . . . . . . . . . . . . . . . . . . . . Efficient Monte Carlo Methods . . . . . . . . . . . . . . . . Ising Models . . . . . . . . . . . . . . . . . . . . . . . . . . Exact Monte Carlo Sampling . . . . . . . . . . . . . . . . . Variational Methods . . . . . . . . . . . . . . . . . . . . . . Independent Component Analysis and Latent Variable Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Random Inference Topics . . . . . . . . . . . . . . . . . . . Decision Theory . . . . . . . . . . . . . . . . . . . . . . . . Bayesian Inference and Sampling Theory . . . . . . . . . .

437 445 451 457

Neural networks . . . . . . . . . . . . . . . . . . . . . . . .

467

38 39 40 41 42 43 44 45 46

. . . . . . . . .

468 471 483 492 505 522 527 535 549

. . . . . . . . . . . . . . . . . . . . .

555

35 36 37 V

VI

Introduction to Neural Networks The Single Neuron as a Classifier Capacity of a Single Neuron . . . Learning as Inference . . . . . . Hopfield Networks . . . . . . . . Boltzmann Machines . . . . . . . Supervised Learning in Multilayer Gaussian Processes . . . . . . . Deconvolution . . . . . . . . . .

Sparse Graph Codes 47 48 49 50

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Networks . . . . . . . . . . . .

Low-Density Parity-Check Codes . . Convolutional Codes and Turbo Codes Repeat–Accumulate Codes . . . . . . Digital Fountain Codes . . . . . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . .

. . . . .

. . . .

. . . . .

. . . .

. . . . . . . . .

. . . . .

. . . .

. . . . . . . . .

. . . . .

. . . .

. . . . . . . . .

. . . . .

. . . .

. . . . . . . . .

597

. . . . .

. . . .

. . . . . . . . .

VII Appendices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . .

. . . . . . . . .

557 574 582 589

. . . . .

. . . .

. . . . . . . . .

. . . .

A Notation . . . . . B Some Physics . . . C Some Mathematics Bibliography . . . . . . Index . . . . . . . . . .

. . . .

. . . . . . . . .

. . . . .

. . . .

343 357 387 400 413 422

. . . . .

. . . . .

598 601 605 613 620

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

Preface This book is aimed at senior undergraduates and graduate students in Engineering, Science, Mathematics, and Computing. It expects familiarity with calculus, probability theory, and linear algebra as taught in a first- or secondyear undergraduate course on mathematics for scientists and engineers. Conventional courses on information theory cover not only the beautiful theoretical ideas of Shannon, but also practical solutions to communication problems. This book goes further, bringing in Bayesian data modelling, Monte Carlo methods, variational methods, clustering algorithms, and neural networks. Why unify information theory and machine learning? Because they are two sides of the same coin. In the 1960s, a single field, cybernetics, was populated by information theorists, computer scientists, and neuroscientists, all studying common problems. Information theory and machine learning still belong together. Brains are the ultimate compression and communication systems. And the state-of-the-art algorithms for both data compression and error-correcting codes use the same tools as machine learning.

How to use this book The essential dependencies between chapters are indicated in the figure on the next page. An arrow from one chapter to another indicates that the second chapter requires some of the first. Within Parts I, II, IV, and V of this book, chapters on advanced or optional topics are towards the end. All chapters of Part III are optional on a first reading, except perhaps for Chapter 16 (Message Passing). The same system sometimes applies within a chapter: the final sections often deal with advanced topics that can be skipped on a first reading. For example in two key chapters – Chapter 4 (The Source Coding Theorem) and Chapter 10 (The Noisy-Channel Coding Theorem) – the first-time reader should detour at section 4.5 and section 10.4 respectively. Pages vii–x show a few ways to use this book. First, I give the roadmap for a course that I teach in Cambridge: ‘Information theory, pattern recognition, and neural networks’. The book is also intended as a textbook for traditional courses in information theory. The second roadmap shows the chapters for an introductory information theory course and the third for a course aimed at an understanding of state-of-the-art error-correcting codes. The fourth roadmap shows how to use the text in a conventional course on machine learning.

v

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

vi

Preface 1

Introduction to Information Theory

IV Probabilities and Inference

2

Probability, Entropy, and Inference

20

An Example Inference Task: Clustering

3

More about Inference

21

Exact Inference by Complete Enumeration

22

Maximum Likelihood and Clustering

23

Useful Probability Distributions

I

Data Compression

4

The Source Coding Theorem

24

Exact Marginalization

5

Symbol Codes

25

Exact Marginalization in Trellises

6

Stream Codes

26

Exact Marginalization in Graphs

7

Codes for Integers

27

Laplace’s Method

28

Model Comparison and Occam’s Razor

29

Monte Carlo Methods

II

Noisy-Channel Coding

8

Dependent Random Variables

30

Efficient Monte Carlo Methods

9

Communication over a Noisy Channel

31

Ising Models

10

The Noisy-Channel Coding Theorem

32

Exact Monte Carlo Sampling

11

Error-Correcting Codes and Real Channels

33

Variational Methods

34

Independent Component Analysis

35

Random Inference Topics

III Further Topics in Information Theory 12

Hash Codes

36

Decision Theory

13

Binary Codes

37

Bayesian Inference and Sampling Theory

14

Very Good Linear Codes Exist

15

Further Exercises on Information Theory

16

Message Passing

38

Introduction to Neural Networks

17

Constrained Noiseless Channels

39

The Single Neuron as a Classifier

18

Crosswords and Codebreaking

40

Capacity of a Single Neuron

19

Why have Sex?

41

Learning as Inference

42

Hopfield Networks

43

Boltzmann Machines

44

Supervised Learning in Multilayer Networks

45

Gaussian Processes

46

Deconvolution

V

Neural networks

VI Sparse Graph Codes

Dependencies

47

Low-Density Parity-Check Codes

48

Convolutional Codes and Turbo Codes

49

Repeat–Accumulate Codes

50

Digital Fountain Codes

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

Preface

vii

1

Introduction to Information Theory

IV Probabilities and Inference

2

Probability, Entropy, and Inference

20

An Example Inference Task: Clustering

3

More about Inference

21

Exact Inference by Complete Enumeration

22

Maximum Likelihood and Clustering

23

Useful Probability Distributions

I

Data Compression

4

The Source Coding Theorem

24

Exact Marginalization

5

Symbol Codes

25

Exact Marginalization in Trellises

6

Stream Codes

26

Exact Marginalization in Graphs

7

Codes for Integers

27

Laplace’s Method

28

Model Comparison and Occam’s Razor

29

Monte Carlo Methods

II

Noisy-Channel Coding

8

Dependent Random Variables

30

Efficient Monte Carlo Methods

9

Communication over a Noisy Channel

31

Ising Models

10

The Noisy-Channel Coding Theorem

32

Exact Monte Carlo Sampling

11

Error-Correcting Codes and Real Channels

33

Variational Methods

34

Independent Component Analysis

35

Random Inference Topics

III Further Topics in Information Theory 12

Hash Codes

36

Decision Theory

13

Binary Codes

37

Bayesian Inference and Sampling Theory

14

Very Good Linear Codes Exist

15

Further Exercises on Information Theory

16

Message Passing

38

Introduction to Neural Networks

17

Constrained Noiseless Channels

39

The Single Neuron as a Classifier

18

Crosswords and Codebreaking

40

Capacity of a Single Neuron

19

Why have Sex?

41

Learning as Inference

42

Hopfield Networks

43

Boltzmann Machines

44

Supervised Learning in Multilayer Networks

45

Gaussian Processes

46

Deconvolution

My Cambridge Course on, Information Theory, Pattern Recognition, and Neural Networks

V

Neural networks

VI Sparse Graph Codes 47

Low-Density Parity-Check Codes

48

Convolutional Codes and Turbo Codes

49

Repeat–Accumulate Codes

50

Digital Fountain Codes

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

viii

Preface 1

Introduction to Information Theory

IV Probabilities and Inference

2

Probability, Entropy, and Inference

20

An Example Inference Task: Clustering

3

More about Inference

21

Exact Inference by Complete Enumeration

22

Maximum Likelihood and Clustering

23

Useful Probability Distributions

I

Data Compression

4

The Source Coding Theorem

24

Exact Marginalization

5

Symbol Codes

25

Exact Marginalization in Trellises

6

Stream Codes

26

Exact Marginalization in Graphs

7

Codes for Integers

27

Laplace’s Method

28

Model Comparison and Occam’s Razor

29

Monte Carlo Methods

II

Noisy-Channel Coding

8

Dependent Random Variables

30

Efficient Monte Carlo Methods

9

Communication over a Noisy Channel

31

Ising Models

10

The Noisy-Channel Coding Theorem

32

Exact Monte Carlo Sampling

11

Error-Correcting Codes and Real Channels

33

Variational Methods

34

Independent Component Analysis

35

Random Inference Topics

III Further Topics in Information Theory 12

Hash Codes

36

Decision Theory

13

Binary Codes

37

Bayesian Inference and Sampling Theory

14

Very Good Linear Codes Exist

15

Further Exercises on Information Theory

16

Message Passing

38

Introduction to Neural Networks

17

Constrained Noiseless Channels

39

The Single Neuron as a Classifier

18

Crosswords and Codebreaking

40

Capacity of a Single Neuron

19

Why have Sex?

41

Learning as Inference

42

Hopfield Networks

43

Boltzmann Machines

44

Supervised Learning in Multilayer Networks

45

Gaussian Processes

46

Deconvolution

V

Neural networks

VI Sparse Graph Codes Short Course on Information Theory

47

Low-Density Parity-Check Codes

48

Convolutional Codes and Turbo Codes

49

Repeat–Accumulate Codes

50

Digital Fountain Codes

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

Preface

ix

1

Introduction to Information Theory

IV Probabilities and Inference

2

Probability, Entropy, and Inference

20

An Example Inference Task: Clustering

3

More about Inference

21

Exact Inference by Complete Enumeration

22

Maximum Likelihood and Clustering

23

Useful Probability Distributions

I

Data Compression

4

The Source Coding Theorem

24

Exact Marginalization

5

Symbol Codes

25

Exact Marginalization in Trellises

6

Stream Codes

26

Exact Marginalization in Graphs

7

Codes for Integers

27

Laplace’s Method

28

Model Comparison and Occam’s Razor

29

Monte Carlo Methods

II

Noisy-Channel Coding

8

Dependent Random Variables

30

Efficient Monte Carlo Methods

9

Communication over a Noisy Channel

31

Ising Models

10

The Noisy-Channel Coding Theorem

32

Exact Monte Carlo Sampling

11

Error-Correcting Codes and Real Channels

33

Variational Methods

34

Independent Component Analysis

35

Random Inference Topics

III Further Topics in Information Theory 12

Hash Codes

36

Decision Theory

13

Binary Codes

37

Bayesian Inference and Sampling Theory

14

Very Good Linear Codes Exist

15

Further Exercises on Information Theory

16

Message Passing

38

Introduction to Neural Networks

17

Constrained Noiseless Channels

39

The Single Neuron as a Classifier

18

Crosswords and Codebreaking

40

Capacity of a Single Neuron

19

Why have Sex?

41

Learning as Inference

42

Hopfield Networks

43

Boltzmann Machines

44

Supervised Learning in Multilayer Networks

45

Gaussian Processes

46

Deconvolution

V

Neural networks

VI Sparse Graph Codes Advanced Course on Information Theory and Coding

47

Low-Density Parity-Check Codes

48

Convolutional Codes and Turbo Codes

49

Repeat–Accumulate Codes

50

Digital Fountain Codes

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

x

Preface 1

Introduction to Information Theory

IV Probabilities and Inference

2

Probability, Entropy, and Inference

20

An Example Inference Task: Clustering

3

More about Inference

21

Exact Inference by Complete Enumeration

22

Maximum Likelihood and Clustering

23

Useful Probability Distributions

I

Data Compression

4

The Source Coding Theorem

24

Exact Marginalization

5

Symbol Codes

25

Exact Marginalization in Trellises

6

Stream Codes

26

Exact Marginalization in Graphs

7

Codes for Integers

27

Laplace’s Method

28

Model Comparison and Occam’s Razor

29

Monte Carlo Methods

II

Noisy-Channel Coding

8

Dependent Random Variables

30

Efficient Monte Carlo Methods

9

Communication over a Noisy Channel

31

Ising Models

10

The Noisy-Channel Coding Theorem

32

Exact Monte Carlo Sampling

11

Error-Correcting Codes and Real Channels

33

Variational Methods

34

Independent Component Analysis

35

Random Inference Topics

III Further Topics in Information Theory 12

Hash Codes

36

Decision Theory

13

Binary Codes

37

Bayesian Inference and Sampling Theory

14

Very Good Linear Codes Exist

15

Further Exercises on Information Theory

16

Message Passing

38

Introduction to Neural Networks

17

Constrained Noiseless Channels

39

The Single Neuron as a Classifier

18

Crosswords and Codebreaking

40

Capacity of a Single Neuron

19

Why have Sex?

41

Learning as Inference

42

Hopfield Networks

43

Boltzmann Machines

44

Supervised Learning in Multilayer Networks

45

Gaussian Processes

46

Deconvolution

V

Neural networks

VI Sparse Graph Codes A Course on Bayesian Inference and Machine Learning

47

Low-Density Parity-Check Codes

48

Convolutional Codes and Turbo Codes

49

Repeat–Accumulate Codes

50

Digital Fountain Codes

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

Preface

xi

About the exercises You can understand a subject only by creating it for yourself. The exercises play an essential role in this book. For guidance, each has a rating (similar to that used by Knuth (1968)) from 1 to 5 to indicate its difficulty. In addition, exercises that are especially recommended are marked by a marginal encouraging rat. Some exercises that require the use of a computer are marked with a C. Answers to many exercises are provided. Use them wisely. Where a solution is provided, this is indicated by including its page number alongside the difficulty rating. Solutions to many of the other exercises will be supplied to instructors using this book in their teaching; please email [email protected]

Summary of codes for exercises Especially recommended . C [p. 42]

Recommended Parts require a computer Solution provided on page 42

[1 ] [2 ] [3 ] [4 ] [5 ]

Simple (one minute) Medium (quarter hour) Moderately hard Hard Research project

Internet resources The website http://www.inference.phy.cam.ac.uk/mackay/itila contains several resources: 1. Software. Teaching software that I use in lectures, interactive software, and research software, written in perl, octave, tcl, C, and gnuplot. Also some animations. 2. Corrections to the book. Thank you in advance for emailing these! 3. This book. The book is provided in postscript, pdf, and djvu formats for on-screen viewing. The same copyright restrictions apply as to a normal book.

About this edition This is the fourth printing of the first edition. In the second printing, the design of the book was altered slightly. Page-numbering generally remained unchanged, except in chapters 1, 6, and 28, where a few paragraphs, figures, and equations moved around. All equation, section, and exercise numbers were unchanged. In the third printing, chapter 8 was renamed ‘Dependent Random Variables’, instead of ‘Correlated’, which was sloppy.

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

xii

Preface

Acknowledgments I am most grateful to the organizations who have supported me while this book gestated: the Royal Society and Darwin College who gave me a fantastic research fellowship in the early years; the University of Cambridge; the Keck Centre at the University of California in San Francisco, where I spent a productive sabbatical; and the Gatsby Charitable Foundation, whose support gave me the freedom to break out of the Escher staircase that book-writing had become. My work has depended on the generosity of free software authors. I wrote the book in LATEX 2ε . Three cheers for Donald Knuth and Leslie Lamport! Our computers run the GNU/Linux operating system. I use emacs, perl, and gnuplot every day. Thank you Richard Stallman, thank you Linus Torvalds, thank you everyone. Many readers, too numerous to name here, have given feedback on the book, and to them all I extend my sincere acknowledgments. I especially wish to thank all the students and colleagues at Cambridge University who have attended my lectures on information theory and machine learning over the last nine years. The members of the Inference research group have given immense support, and I thank them all for their generosity and patience over the last ten years: Mark Gibbs, Michelle Povinelli, Simon Wilson, Coryn Bailer-Jones, Matthew Davey, Katriona Macphee, James Miskin, David Ward, Edward Ratzer, Seb Wills, John Barry, John Winn, Phil Cowans, Hanna Wallach, Matthew Garrett, and especially Sanjoy Mahajan. Thank you too to Graeme Mitchison, Mike Cates, and Davin Yap. Finally I would like to express my debt to my personal heroes, the mentors from whom I have learned so much: Yaser Abu-Mostafa, Andrew Blake, John Bridle, Peter Cheeseman, Steve Gull, Geoff Hinton, John Hopfield, Steve Luttrell, Robert MacKay, Bob McEliece, Radford Neal, Roger Sewell, and John Skilling.

Dedication This book is dedicated to the campaign against the arms trade. www.caat.org.uk

Peace cannot be kept by force. It can only be achieved through understanding. – Albert Einstein

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

About Chapter 1 In the first chapter, you will need to be familiar with the binomial distribution. And to solve the exercises in the text – which I urge you to do – you will need to know Stirling’s approximation for the factorial function, x! ' x x e−x , and  N! N be able to apply it to r = (N −r)! r! . These topics are reviewed below.

Unfamiliar notation? See Appendix A, p.598.

The binomial distribution Example 1.1. A bent coin has probability f of coming up heads. The coin is tossed N times. What is the probability distribution of the number of heads, r? What are the mean and variance of r? Solution. The number of heads has a binomial distribution.   N r f (1 − f )N −r . P (r | f, N ) = r

(1.1)

0 1 2 3 4 5 6 7 8 9 10

The mean, E[r], and variance, var[r], of this distribution are defined by E[r] ≡

N X r=0

P (r | f, N ) r

i h var[r] ≡ E (r − E[r])2

r

(1.2)

(1.3)

= E[r 2 ] − (E[r])2 =

N X r=0

P (r | f, N )r 2 − (E[r])2 .

(1.4)

Rather than evaluating the sums over r in (1.2) and (1.4) directly, it is easiest to obtain the mean and variance by noting that r is the sum of N independent random variables, namely, the number of heads in the first toss (which is either zero or one), the number of heads in the second toss, and so forth. In general, E[x + y] = E[x] + E[y] for any random variables x and y; var[x + y] = var[x] + var[y] if x and y are independent.

(1.5)

So the mean of r is the sum of the means of those random variables, and the variance of r is the sum of their variances. The mean number of heads in a single toss is f × 1 + (1 − f ) × 0 = f , and the variance of the number of heads in a single toss is 

 f × 12 + (1 − f ) × 02 − f 2 = f − f 2 = f (1 − f ),

(1.6)

so the mean and variance of r are: E[r] = N f

var[r] = N f (1 − f ).

and 1

2

0.3 0.25 0.2 0.15 0.1 0.05 0

(1.7)

Figure 1.1. The binomial distribution P (r | f = 0.3, N = 10).

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

2

About Chapter 1

Approximating x! and

N r



Let’s derive Stirling’s approximation by an unconventional route. We start from the Poisson distribution with mean λ,

0.12 0.1 0.08 0.06

P (r | λ) = e

−λ λ

r

r!

0.04

r ∈ {0, 1, 2, . . .}.

(1.8)

0.02 0 0

For large λ, this distribution is well approximated – at least in the vicinity of r ' λ – by a Gaussian distribution with mean λ and variance λ: e−λ

λr

1 ' √ r! 2πλ

(r−λ)2 − 2λ e

.

(1.9)

5

10

15

20

25

r Figure 1.2. The Poisson distribution P (r | λ = 15).

Let’s plug r = λ into this formula, then rearrange it. λλ λ!

1 √ 2πλ √ ⇒ λ! ' λλ e−λ 2πλ.

e−λ

'

This is Stirling’s approximation for the factorial function. √ x! ' xx e−x 2πx ⇔ ln x! ' x ln x − x + 12 ln 2πx.

(1.10) (1.11)

(1.12)

We have derived not only the leading order behaviour, x! ' x x e−x , but also, √ at no cost, the next-order correction term 2πx. We now apply Stirling’s  approximation to ln Nr :   N N N N! ' (N − r) ln + r ln . (1.13) ln ≡ ln (N − r)! r! N −r r r Since all the terms in this equation are logarithms, this result can be rewritten in any base. We will denote natural logarithms (log e ) by ‘ln’, and logarithms to base 2 (log 2 ) by ‘log’. If we introduce the binary entropy function, H2 (x) ≡ x log

1 1 + (1−x) log , x (1−x)

then we can rewrite the approximation (1.13) as   N log ' N H2 (r/N ), r or, equivalently,

  N ' 2N H2 (r/N ) . r

loge x . loge 2 ∂ log2 x 1 1 Note that = . ∂x loge 2 x

Recall that log2 x =

(1.14)

H2 (x)

1

0.8

(1.15)

0.6 0.4 0.2

(1.16)

If we need a more accurate approximation, we can include terms of the next order from Stirling’s approximation (1.12):     N N −r r ' N H2 (r/N ) − 12 log 2πN . (1.17) log r N N

0 0

0.2

0.4

0.6

0.8

1

x

Figure 1.3. The binary entropy function.

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

1 Introduction to Information Theory The fundamental problem of communication is that of reproducing at one point either exactly or approximately a message selected at another point. (Claude Shannon, 1948) In the first half of this book we study how to measure information content; we learn how to compress data; and we learn how to communicate perfectly over imperfect communication channels. We start by getting a feeling for this last problem.

1.1 How can we achieve perfect communication over an imperfect, noisy communication channel? Some examples of noisy communication channels are: modem

- phone -

Galileo

- radio -

• an analogue telephone line, over which two modems communicate digital information; • the radio communication link from Galileo, the Jupiter-orbiting spacecraft, to earth; • reproducing cells, in which the daughter cells’ DNA contains information from the parent cells; • a disk drive. The last example shows that communication doesn’t have to involve information going from one place to another. When we write a file on a disk drive, we’ll read it off in the same location – but at a later time. These channels are noisy. A telephone line suffers from cross-talk with other lines; the hardware in the line distorts and adds noise to the transmitted signal. The deep space network that listens to Galileo’s puny transmitter receives background radiation from terrestrial and cosmic sources. DNA is subject to mutations and damage. A disk drive, which writes a binary digit (a one or zero, also known as a bit) by aligning a patch of magnetic material in one of two orientations, may later fail to read out the stored binary digit: the patch of material might spontaneously flip magnetization, or a glitch of background noise might cause the reading circuit to report the wrong value for the binary digit, or the writing head might not induce the magnetization in the first place because of interference from neighbouring bits. In all these cases, if we transmit data, e.g., a string of bits, over the channel, there is some probability that the received message will not be identical to the 3

line

waves

modem

Earth

daughter cell parent  cell @ R @ daughter cell computer - disk - computer memory memory drive

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

4

1 — Introduction to Information Theory

transmitted message. We would prefer to have a communication channel for which this probability was zero – or so close to zero that for practical purposes it is indistinguishable from zero. Let’s consider a noisy disk drive that transmits each bit correctly with probability (1−f ) and incorrectly with probability f . This model communication channel is known as the binary symmetric channel (figure 1.4).

x

0 -0 @ 

R @ 1 -1

y

P (y = 0 | x = 0) = 1 − f ; P (y = 0 | x = 1) = f ; P (y = 1 | x = 0) = f ; P (y = 1 | x = 1) = 1 − f.

0

(1 − f ) @

1

-0 

@f @ R @ -1

(1 − f )

As an example, let’s imagine that f = 0.1, that is, ten per cent of the bits are flipped (figure 1.5). A useful disk drive would flip no bits at all in its entire lifetime. If we expect to read and write a gigabyte per day for ten years, we require a bit error probability of the order of 10 −15 , or smaller. There are two approaches to this goal.

The physical solution The physical solution is to improve the physical characteristics of the communication channel to reduce its error probability. We could improve our disk drive by 1. using more reliable components in its circuitry; 2. evacuating the air from the disk enclosure so as to eliminate the turbulence that perturbs the reading head from the track; 3. using a larger magnetic patch to represent each bit; or 4. using higher-power signals or cooling the circuitry in order to reduce thermal noise. These physical modifications typically increase the cost of the communication channel.

The ‘system’ solution Information theory and coding theory offer an alternative (and much more exciting) approach: we accept the given noisy channel as it is and add communication systems to it so that we can detect and correct the errors introduced by the channel. As shown in figure 1.6, we add an encoder before the channel and a decoder after it. The encoder encodes the source message s into a transmitted message t, adding redundancy to the original message in some way. The channel adds noise to the transmitted message, yielding a received message r. The decoder uses the known redundancy introduced by the encoding system to infer both the original signal s and the added noise.

Figure 1.4. The binary symmetric channel. The transmitted symbol is x and the received symbol y. The noise level, the probability that a bit is flipped, is f . Figure 1.5. A binary data sequence of length 10 000 transmitted over a binary symmetric channel with noise level f = 0.1. [Dilbert image c Copyright 1997 United Feature Syndicate, Inc., used with permission.]

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

1.2: Error-correcting codes for the binary symmetric channel

Figure 1.6. The ‘system’ solution for achieving reliable communication over a noisy channel. The encoding system introduces systematic redundancy into the transmitted vector t. The decoding system uses this known redundancy to deduce from the received vector r both the original source vector and the noise introduced by the channel.

Source 6

s

ˆs

?

Encoder

t

Decoder 6

r

Noisy channel

-

5

Whereas physical solutions give incremental channel improvements only at an ever-increasing cost, system solutions can turn noisy channels into reliable communication channels with the only cost being a computational requirement at the encoder and decoder. Information theory is concerned with the theoretical limitations and potentials of such systems. ‘What is the best error-correcting performance we could achieve?’ Coding theory is concerned with the creation of practical encoding and decoding systems.

1.2 Error-correcting codes for the binary symmetric channel We now consider examples of encoding and decoding systems. What is the simplest way to add useful redundancy to a transmission? [To make the rules of the game clear: we want to be able to detect and correct errors; and retransmission is not an option. We get only one chance to encode, transmit, and decode.]

Repetition codes A straightforward idea is to repeat every bit of the message a prearranged number of times – for example, three times, as shown in table 1.7. We call this repetition code ‘R3 ’. Imagine that we transmit the source message s=0 0 1 0 1 1 0

Source sequence s

Transmitted sequence t

0 1

000 111

Table 1.7. The repetition code R3 .

over a binary symmetric channel with noise level f = 0.1 using this repetition code. We can describe the channel as ‘adding’ a sparse noise vector n to the transmitted vector – adding in modulo 2 arithmetic, i.e., the binary algebra in which 1+1=0. A possible noise vector n and received vector r = t + n are shown in figure 1.8. s

0 z}|{ t 000 n 000 r 000

0 z}|{ 000 001 001

1 z}|{ 111 000 111

0 z}|{ 000 000 000

1 z}|{ 111 101 010

1 z}|{ 111 000 111

0 z}|{ 000 000 000

How should we decode this received vector? The optimal algorithm looks at the received bits three at a time and takes a majority vote (algorithm 1.9).

Figure 1.8. An example transmission using R3 .

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

6

1 — Introduction to Information Theory

Received sequence r

Likelihood ratio γ −3 γ −1 γ −1 γ −1 γ1 γ1 γ1 γ3

000 001 010 100 101 110 011 111

P (r | s = 1) P (r | s = 0)

Decoded sequence ˆs 0 0 0 0 1 1 1 1

At the risk of explaining the obvious, let’s prove this result. The optimal decoding decision (optimal in the sense of having the smallest probability of being wrong) is to find which value of s is most probable, given r. Consider the decoding of a single bit s, which was encoded as t(s) and gave rise to three received bits r = r1 r2 r3 . By Bayes’ theorem, the posterior probability of s is P (s | r1 r2 r3 ) =

P (r1 r2 r3 | s)P (s) . P (r1 r2 r3 )

(1.18)

We can spell out the posterior probability of the two alternatives thus: P (s = 1 | r1 r2 r3 ) =

P (r1 r2 r3 | s = 1)P (s = 1) ; P (r1 r2 r3 )

(1.19)

P (s = 0 | r1 r2 r3 ) =

P (r1 r2 r3 | s = 0)P (s = 0) . P (r1 r2 r3 )

(1.20)

This posterior probability is determined by two factors: the prior probability P (s), and the data-dependent term P (r1 r2 r3 | s), which is called the likelihood of s. The normalizing constant P (r1 r2 r3 ) needn’t be computed when finding the optimal decoding decision, which is to guess sˆ = 0 if P (s = 0 | r) > P (s = 1 | r), and sˆ = 1 otherwise. To find P (s = 0 | r) and P (s = 1 | r), we must make an assumption about the prior probabilities of the two hypotheses s = 0 and s = 1, and we must make an assumption about the probability of r given s. We assume that the prior probabilities are equal: P (s = 0) = P (s = 1) = 0.5; then maximizing the posterior probability P (s | r) is equivalent to maximizing the likelihood P (r | s). And we assume that the channel is a binary symmetric channel with noise level f < 0.5, so that the likelihood is P (r | s) = P (r | t(s)) =

N Y

n=1

P (rn | tn (s)),

(1.21)

where N = 3 is the number of transmitted bits in the block we are considering, and  (1−f ) if rn = tn (1.22) P (rn | tn ) = f if rn 6= tn . Thus the likelihood ratio for the two hypotheses is N Y P (rn | tn (1)) P (r | s = 1) = ; P (r | s = 0) n=1 P (rn | tn (0))

each factor (1−f ) f

P (rn |tn (1)) P (rn |tn (0))

equals

(1−f ) f

if rn = 1 and

f (1−f )

(1.23) if rn = 0. The ratio

is greater than 1, since f < 0.5, so the winning hypothesis is the γ ≡ one with the most ‘votes’, each vote counting for a factor of γ in the likelihood ratio.

Algorithm 1.9. Majority-vote decoding algorithm for R3 . Also shown are the likelihood ratios (1.23), assuming the channel is a binary symmetric channel; γ ≡ (1 − f )/f .

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

1.2: Error-correcting codes for the binary symmetric channel

7

Thus the majority-vote decoder shown in algorithm 1.9 is the optimal decoder if we assume that the channel is a binary symmetric channel and that the two possible source messages 0 and 1 have equal prior probability.

We now apply the majority vote decoder to the received vector of figure 1.8. The first three received bits are all 0, so we decode this triplet as a 0. In the second triplet of figure 1.8, there are two 0s and one 1, so we decode this triplet as a 0 – which in this case corrects the error. Not all errors are corrected, however. If we are unlucky and two errors fall in a single block, as in the fifth triplet of figure 1.8, then the decoding rule gets the wrong answer, as shown in figure 1.10. s

0 z}|{ t 000 n 000 r 0 00 |{z}

ˆs corrected errors undetected errors

0

0 z}|{ 000 001 0 01 |{z} 0 ?

1 z}|{ 111 000 1 11 |{z} 1

0 z}|{ 000 000 0 00 |{z} 0

1 z}|{ 111 101 0 10 |{z} 0

?

1 z}|{ 111 000 1 11 |{z} 1

0

Exercise 1.2.[2, p.16] Show that the error probability is reduced by the use of R3 by computing the error probability of this code for a binary symmetric channel with noise level f . The error probability is dominated by the probability that two bits in a block of three are flipped, which scales as f 2 . In the case of the binary symmetric channel with f = 0.1, the R 3 code has a probability of error, after decoding, of pb ' 0.03 per bit. Figure 1.11 shows the result of transmitting a binary image over a binary symmetric channel using the repetition code.

s

encoder -

t

channel f = 10% -

r

Figure 1.10. Decoding the received vector from figure 1.8.

0 z}|{ 000 000 0 00 |{z}

decoder

The exercise’s rating, e.g.‘[2 ]’, indicates its difficulty: ‘1’ exercises are the easiest. Exercises that are accompanied by a marginal rat are especially recommended. If a solution or partial solution is provided, the page is indicated after the difficulty rating; for example, this exercise’s solution is on page 16.

ˆs

-

Figure 1.11. Transmitting 10 000 source bits over a binary symmetric channel with f = 10% using a repetition code and the majority vote decoding algorithm. The probability of decoded bit error has fallen to about 3%; the rate has fallen to 1/3.

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

8

1 — Introduction to Information Theory

0.1

0.1 0.01

R1

0.08

R5

1e-05

R1 R3

more useful codes

pb 0.06

0.04

1e-10 R3

0.02 R5 R61

more useful codes

0

R61

1e-15 0

0.2

0.4

0.6 Rate

0.8

1

0

0.2

0.4

0.6 Rate

0.8

1

The repetition code R3 has therefore reduced the probability of error, as desired. Yet we have lost something: our rate of information transfer has fallen by a factor of three. So if we use a repetition code to communicate data over a telephone line, it will reduce the error frequency, but it will also reduce our communication rate. We will have to pay three times as much for each phone call. Similarly, we would need three of the original noisy gigabyte disk drives in order to create a one-gigabyte disk drive with p b = 0.03. Can we push the error probability lower, to the values required for a sellable disk drive – 10−15 ? We could achieve lower error probabilities by using repetition codes with more repetitions. Exercise 1.3.[3, p.16] (a) Show that the probability of error of R N , the repetition code with N repetitions, is pb =

N X

n=(N +1)/2

  N n f (1 − f )N −n , n

(1.24)

for odd N . (b) Assuming f = 0.1, which of the terms in this sum is the biggest? How much bigger is it than the second-biggest term?  (c) Use Stirling’s approximation (p.2) to approximate the N n in the largest term, and find, approximately, the probability of error of the repetition code with N repetitions. (d) Assuming f = 0.1, find how many repetitions are required to get the probability of error down to 10−15 . [Answer: about 60.] So to build a single gigabyte disk drive with the required reliability from noisy gigabyte drives with f = 0.1, we would need sixty of the noisy disk drives. The tradeoff between error probability and rate for repetition codes is shown in figure 1.12.

Block codes – the (7, 4) Hamming code We would like to communicate with tiny probability of error and at a substantial rate. Can we improve on repetition codes? What if we add redundancy to blocks of data instead of encoding one bit at a time? We now study a simple block code.

Figure 1.12. Error probability pb versus rate for repetition codes over a binary symmetric channel with f = 0.1. The right-hand figure shows pb on a logarithmic scale. We would like the rate to be large and pb to be small.

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

1.2: Error-correcting codes for the binary symmetric channel

9

A block code is a rule for converting a sequence of source bits s, of length K, say, into a transmitted sequence t of length N bits. To add redundancy, we make N greater than K. In a linear block code, the extra N − K bits are linear functions of the original K bits; these extra bits are called parity-check bits. An example of a linear block code is the (7, 4) Hamming code, which transmits N = 7 bits for every K = 4 source bits. t5 s1 t7

Figure 1.13. Pictorial representation of encoding for the (7, 4) Hamming code.

1

s3 s4

s2

1

0 0

t

1

6

0

0

(b)

(a)

The encoding operation for the code is shown pictorially in figure 1.13. We arrange the seven transmitted bits in three intersecting circles. The first four transmitted bits, t1 t2 t3 t4 , are set equal to the four source bits, s 1 s2 s3 s4 . The parity-check bits t5 t6 t7 are set so that the parity within each circle is even: the first parity-check bit is the parity of the first three source bits (that is, it is 0 if the sum of those bits is even, and 1 if the sum is odd); the second is the parity of the last three; and the third parity bit is the parity of source bits one, three and four. As an example, figure 1.13b shows the transmitted codeword for the case s = 1000. Table 1.14 shows the codewords generated by each of the 2 4 = sixteen settings of the four source bits. These codewords have the special property that any pair differ from each other in at least three bits. s

t

s

t

s

t

s

t

0000 0001 0010 0011

0000000 0001011 0010111 0011100

0100 0101 0110 0111

0100110 0101101 0110001 0111010

1000 1001 1010 1011

1000101 1001110 1010010 1011001

1100 1101 1110 1111

1100011 1101000 1110100 1111111

Because the Hamming code is a linear code, it can be written compactly in terms of matrices as follows. The transmitted codeword t is obtained from the source sequence s by a linear operation, t = GTs, where G is the generator matrix of the code,  1 0 0  0 1 0   0 0 1  T G =  0 0 0  1 1 1   0 1 1 1 0 1

(1.25)

0 0 0 1 0 1 1



    ,    

(1.26)

and the encoding operation (1.25) uses modulo-2 arithmetic (1 + 1 = 0, 0 + 1 = 1, etc.). In the encoding operation (1.25) I have assumed that s and t are column vectors. If instead they are row vectors, then this equation is replaced by t = sG,

(1.27)

Table 1.14. The sixteen codewords {t} of the (7, 4) Hamming code. Any pair of codewords differ from each other in at least three bits.

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

10

1 — Introduction to Information Theory where



1  0  G= 0 0

0 1 0 0

0 0 1 0

0 0 0 1

1 1 1 0

0 1 1 1

 1 0  . 1  1

(1.28)

I find it easier to relate to the right-multiplication (1.25) than the left-multiplication (1.27). Many coding theory texts use the left-multiplying conventions (1.27–1.28), however. The rows of the generator matrix (1.28) can be viewed as defining four basis vectors lying in a seven-dimensional binary space. The sixteen codewords are obtained by making all possible linear combinations of these vectors.

Decoding the (7, 4) Hamming code When we invent a more complex encoder s → t, the task of decoding the received vector r becomes less straightforward. Remember that any of the bits may have been flipped, including the parity bits. If we assume that the channel is a binary symmetric channel and that all source vectors are equiprobable, then the optimal decoder identifies the source vector s whose encoding t(s) differs from the received vector r in the fewest bits. [Refer to the likelihood function (1.23) to see why this is so.] We could solve the decoding problem by measuring how far r is from each of the sixteen codewords in table 1.14, then picking the closest. Is there a more efficient way of finding the most probable source vector? Syndrome decoding for the Hamming code For the (7, 4) Hamming code there is a pictorial solution to the decoding problem, based on the encoding picture, figure 1.13. As a first example, let’s assume the transmission was t = 1000101 and the noise flips the second bit, so the received vector is r = 1000101 ⊕ 0100000 = 1100101. We write the received vector into the three circles as shown in figure 1.15a, and look at each of the three circles to see whether its parity is even. The circles whose parity is not even are shown by dashed lines in figure 1.15b. The decoding task is to find the smallest set of flipped bits that can account for these violations of the parity rules. [The pattern of violations of the parity checks is called the syndrome, and can be written as a binary vector – for example, in figure 1.15b, the syndrome is z = (1, 1, 0), because the first two circles are ‘unhappy’ (parity 1) and the third circle is ‘happy’ (parity 0).] To solve the decoding task, we ask the question: can we find a unique bit that lies inside all the ‘unhappy’ circles and outside all the ‘happy’ circles? If so, the flipping of that bit would account for the observed syndrome. In the case shown in figure 1.15b, the bit r2 lies inside the two unhappy circles and outside the happy circle; no other single bit has this property, so r 2 is the only single bit capable of explaining the syndrome. Let’s work through a couple more examples. Figure 1.15c shows what happens if one of the parity bits, t5 , is flipped by the noise. Just one of the checks is violated. Only r5 lies inside this unhappy circle and outside the other two happy circles, so r5 is identified as the only single bit capable of explaining the syndrome. If the central bit r3 is received flipped, figure 1.15d shows that all three checks are violated; only r3 lies inside all three circles, so r3 is identified as the suspect bit.

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

1.2: Error-correcting codes for the binary symmetric channel

11

r5 r1

r3

r2 r

r4

r7

6

(a)

0*

1 1*

1

0

1

0 1

1 1

1*

0 0

0

1

(b)

1

0

0

(c)

0*

1* 0

0

0

(d)

1 1

0

1 0

-

0

1

1*

1

0*

0

0

Figure 1.15. Pictorial representation of decoding of the Hamming (7, 4) code. The received vector is written into the diagram as shown in (a). In (b,c,d,e), the received vector is shown, assuming that the transmitted vector was as in figure 1.13b and the bits labelled by ? were flipped. The violated parity checks are highlighted by dashed circles. One of the seven bits is the most probable suspect to account for each ‘syndrome’, i.e., each pattern of violated and satisfied parity checks. In examples (b), (c), and (d), the most probable suspect is the one bit that was flipped. In example (e), two bits have been flipped, s3 and t7 . The most probable suspect is r2 , marked by a circle in (e0 ), which shows the output of the decoding algorithm.

0

(e)

(e )

Syndrome z

000

001

010

011

100

101

110

111

Unflip this bit

none

r7

r6

r4

r5

r1

r2

r3

If you try flipping any one of the seven bits, you’ll find that a different syndrome is obtained in each case – seven non-zero syndromes, one for each bit. There is only one other syndrome, the all-zero syndrome. So if the channel is a binary symmetric channel with a small noise level f , the optimal decoder unflips at most one bit, depending on the syndrome, as shown in algorithm 1.16. Each syndrome could have been caused by other noise patterns too, but any other noise pattern that has the same syndrome must be less probable because it involves a larger number of noise events. What happens if the noise actually flips more than one bit? Figure 1.15e shows the situation when two bits, r 3 and r7 , are received flipped. The syndrome, 110, makes us suspect the single bit r 2 ; so our optimal decoding algorithm flips this bit, giving a decoded pattern with three errors as shown in figure 1.15e0 . If we use the optimal decoding algorithm, any two-bit error pattern will lead to a decoded seven-bit vector that contains three errors.

General view of decoding for linear codes: syndrome decoding We can also describe the decoding problem for a linear code in terms of matrices. The first four received bits, r1 r2 r3 r4 , purport to be the four source bits; and the received bits r5 r6 r7 purport to be the parities of the source bits, as defined by the generator matrix G. We evaluate the three parity-check bits for the received bits, r1 r2 r3 r4 , and see whether they match the three received bits, r5 r6 r7 . The differences (modulo 2) between these two triplets are called the syndrome of the received vector. If the syndrome is zero – if all three parity checks are happy – then the received vector is a codeword, and the most probable decoding is

Algorithm 1.16. Actions taken by the optimal decoder for the (7, 4) Hamming code, assuming a binary symmetric channel with small noise level f . The syndrome vector z lists whether each parity check is violated (1) or satisfied (0), going through the checks in the order of the bits r5 , r6 , and r7 .

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

12

1 — Introduction to Information Theory

s

encoder -

parity bits

t

r

channel f = 10%

decoder

-

-

      

given by reading out its first four bits. If the syndrome is non-zero, then the noise sequence for this block was non-zero, and the syndrome is our pointer to the most probable error pattern. The computation of the syndrome vector is a linear operation. If we define the 3 × 4 matrix P such that the matrix of equation (1.26) is   I4 T , (1.29) G = P where I4 is the 4 × 4 identity matrix, then the where the parity-check matrix H is given by H arithmetic, −1 ≡ 1, so  1 1 1 0   H = P I3 =  0 1 1 1 1 0 1 1

All the codewords t = GTs of the code satisfy   0 Ht =  0  . 0 [1 ]

. Exercise 1.4.

ˆs

syndrome vector   is z = Hr, = −P I3 ; in modulo 2  1 0 0 0 1 0 . 0 0 1

(1.30)

(1.31)

Prove that this is so by evaluating the 3 × 4 matrix HGT.

Since the received vector r is given by r = GTs + n, the syndrome-decoding problem is to find the most probable noise vector n satisfying the equation Hn = z.

(1.32)

A decoding algorithm that solves this problem is called a maximum-likelihood decoder. We will discuss decoding problems like this in later chapters.

Summary of the (7, 4) Hamming code’s properties Every possible received vector of length 7 bits is either a codeword, or it’s one flip away from a codeword. Since there are three parity constraints, each of which might or might not be violated, there are 2 × 2 × 2 = 8 distinct syndromes. They can be divided into seven non-zero syndromes – one for each of the one-bit error patterns – and the all-zero syndrome, corresponding to the zero-noise case. The optimal decoder takes no action if the syndrome is zero, otherwise it uses this mapping of non-zero syndromes onto one-bit error patterns to unflip the suspect bit.

Figure 1.17. Transmitting 10 000 source bits over a binary symmetric channel with f = 10% using a (7, 4) Hamming code. The probability of decoded bit error is about 7%.

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

1.2: Error-correcting codes for the binary symmetric channel

13

There is a decoding error if the four decoded bits sˆ1 , sˆ2 , sˆ3 , sˆ4 do not all match the source bits s1 , s2 , s3 , s4 . The probability of block error pB is the probability that one or more of the decoded bits in one block fail to match the corresponding source bits, pB = P (ˆs 6= s). (1.33) The probability of bit error pb is the average probability that a decoded bit fails to match the corresponding source bit, pb =

K 1 X P (ˆ sk 6= sk ). K

(1.34)

k=1

In the case of the Hamming code, a decoding error will occur whenever the noise has flipped more than one bit in a block of seven. The probability of block error is thus the probability that two or more bits are flipped in a block. This probability scales as O(f 2 ), as did the probability of error for the repetition code R3 . But notice that the Hamming code communicates at a greater rate, R = 4/7. Figure 1.17 shows a binary image transmitted over a binary symmetric channel using the (7, 4) Hamming code. About 7% of the decoded bits are in error. Notice that the errors are correlated: often two or three successive decoded bits are flipped. Exercise 1.5.[1 ] This exercise and the next three refer to the (7, 4) Hamming code. Decode the received strings: (a) r = 1101011 (b) r = 0110110 (c) r = 0100111 (d) r = 1111111. Exercise 1.6.[2, p.17] (a) Calculate the probability of block error p B of the (7, 4) Hamming code as a function of the noise level f and show that to leading order it goes as 21f 2 . (b) [3 ] Show that to leading order the probability of bit error p b goes as 9f 2 . Exercise 1.7.[2, p.19] Find some noise vectors that give the all-zero syndrome (that is, noise vectors that leave all the parity checks unviolated). How many such noise vectors are there? . Exercise 1.8.[2 ] I asserted above that a block decoding error will result whenever two or more bits are flipped in a single block. Show that this is indeed so. [In principle, there might be error patterns that, after decoding, led only to the corruption of the parity bits, with no source bits incorrectly decoded.]

Summary of codes’ performances Figure 1.18 shows the performance of repetition codes and the Hamming code. It also shows the performance of a family of linear block codes that are generalizations of Hamming codes, called BCH codes. This figure shows that we can, using linear block codes, achieve better performance than repetition codes; but the asymptotic situation still looks grim.

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

14

1 — Introduction to Information Theory 0.1 0.01

R1

0.1

0.08

R5

1e-05

more useful codes

pb

H(7,4)

R1

H(7,4)

BCH(511,76)

0.06

0.04

Figure 1.18. Error probability pb versus rate R for repetition codes, the (7, 4) Hamming code and BCH codes with blocklengths up to 1023 over a binary symmetric channel with f = 0.1. The righthand figure shows pb on a logarithmic scale.

1e-10

BCH(31,16) R3

0.02 R5

BCH(15,7) more useful codes

BCH(1023,101)

0

1e-15 0

0.2

0.4

0.6 Rate

0.8

1

0

0.2

0.4

0.6 Rate

0.8

1

Exercise 1.9.[4, p.19] Design an error-correcting code and a decoding algorithm for it, estimate its probability of error, and add it to figure 1.18. [Don’t worry if you find it difficult to make a code better than the Hamming code, or if you find it difficult to find a good decoder for your code; that’s the point of this exercise.] Exercise 1.10.[3, p.20] A (7, 4) Hamming code can correct any one error; might there be a (14, 8) code that can correct any two errors? Optional extra: Does the answer to this question depend on whether the code is linear or nonlinear? Exercise 1.11.[4, p.21] Design an error-correcting code, other than a repetition code, that can correct any two errors in a block of size N .

1.3 What performance can the best codes achieve? There seems to be a trade-off between the decoded bit-error probability p b (which we would like to reduce) and the rate R (which we would like to keep large). How can this trade-off be characterized? What points in the (R, p b ) plane are achievable? This question was addressed by Claude Shannon in his pioneering paper of 1948, in which he both created the field of information theory and solved most of its fundamental problems. At that time there was a widespread belief that the boundary between achievable and nonachievable points in the (R, p b ) plane was a curve passing through the origin (R, pb ) = (0, 0); if this were so, then, in order to achieve a vanishingly small error probability p b , one would have to reduce the rate correspondingly close to zero. ‘No pain, no gain.’ However, Shannon proved the remarkable result that the boundary between achievable and nonachievable points meets the R axis at a non-zero value R = C, as shown in figure 1.19. For any channel, there exist codes that make it possible to communicate with arbitrarily small probability of error p b at non-zero rates. The first half of this book (Parts I–III) will be devoted to understanding this remarkable result, which is called the noisy-channel coding theorem.

Example: f = 0.1 The maximum rate at which communication is possible with arbitrarily small pb is called the capacity of the channel. The formula for the capacity of a



Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

1.4: Summary

15 0.1 0.01

R1

0.1

0.08

R1

R5

1e-05

pb

H(7,4) 0.06

0.04

1e-10 R3

achievable

not achievable

0.02 R5

achievable

not achievable

0

1e-15 0

0.2

0.4

C

0.6 Rate

0.8

1

0

0.2

0.4

C

0.6 Rate

binary symmetric channel with noise level f is   1 1 C(f ) = 1 − H2 (f ) = 1 − f log2 + (1 − f ) log 2 ; f 1−f

0.8

1

(1.35)

the channel we were discussing earlier with noise level f = 0.1 has capacity C ' 0.53. Let us consider what this means in terms of noisy disk drives. The repetition code R3 could communicate over this channel with p b = 0.03 at a rate R = 1/3. Thus we know how to build a single gigabyte disk drive with pb = 0.03 from three noisy gigabyte disk drives. We also know how to make a single gigabyte disk drive with pb ' 10−15 from sixty noisy one-gigabyte drives (exercise 1.3, p.8). And now Shannon passes by, notices us juggling with disk drives and codes and says: ‘What performance are you trying to achieve? 10 −15 ? You don’t need sixty disk drives – you can get that performance with just two disk drives (since 1/2 is less than 0.53). And if you want pb = 10−18 or 10−24 or anything, you can get there with two disk drives too!’ [Strictly, the above statements might not be quite right, since, as we shall see, Shannon proved his noisy-channel coding theorem by studying sequences of block codes with ever-increasing blocklengths, and the required blocklength might be bigger than a gigabyte (the size of our disk drive), in which case, Shannon might say ‘well, you can’t do it with those tiny disk drives, but if you had two noisy terabyte drives, you could make a single high-quality terabyte drive from them’.]

1.4 Summary The (7, 4) Hamming Code By including three parity-check bits in a block of 7 bits it is possible to detect and correct any single bit error in each block.

Shannon’s noisy-channel coding theorem Information can be communicated over a noisy channel at a non-zero rate with arbitrarily small error probability.

Figure 1.19. Shannon’s noisy-channel coding theorem. The solid curve shows the Shannon limit on achievable values of (R, pb ) for the binary symmetric channel with f = 0.1. Rates up to R = C are achievable with arbitrarily small pb . The points show the performance of some textbook codes, as in figure 1.18. The equation defining the Shannon limit (the solid curve) is R = C/(1 − H2 (pb )), where C and H2 are defined in equation (1.35).

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

16

1 — Introduction to Information Theory

Information theory addresses both the limitations and the possibilities of communication. The noisy-channel coding theorem, which we will prove in Chapter 10, asserts both that reliable communication at any rate beyond the capacity is impossible, and that reliable communication at all rates up to capacity is possible. The next few chapters lay the foundations for this result by discussing how to measure information content and the intimately related topic of data compression.

1.5 Further exercises . Exercise 1.12.[2, p.21] Consider the repetition code R9 . One way of viewing this code is as a concatenation of R3 with R3 . We first encode the source stream with R3 , then encode the resulting output with R 3 . We could call this code ‘R23 ’. This idea motivates an alternative decoding algorithm, in which we decode the bits three at a time using the decoder for R3 ; then decode the decoded bits from that first decoder using the decoder for R3 . Evaluate the probability of error for this decoder and compare it with the probability of error for the optimal decoder for R 9 . Do the concatenated encoder and decoder for R 23 have advantages over those for R9 ?

1.6 Solutions Solution to exercise 1.2 (p.7). An error is made by R 3 if two or more bits are flipped in a block of three. So the error probability of R 3 is a sum of two terms: the probability that all three bits are flipped, f 3 ; and the probability that exactly two bits are flipped, 3f 2 (1 − f ). [If these expressions are not obvious, see example 1.1 (p.1): the expressions are P (r = 3 | f, N = 3) and P (r = 2 | f, N = 3).] pb = pB = 3f 2 (1 − f ) + f 3 = 3f 2 − 2f 3 .

(1.36)

This probability is dominated for small f by the term 3f 2 . See exercise 2.38 (p.39) for further discussion of this problem. Solution to exercise 1.3 (p.8). The probability of error for the repetition code RN is dominated by the probability that dN/2e bits are flipped, which goes (for odd N ) as   N f (N +1)/2 (1 − f )(N −1)/2 . (1.37) dN/2e  N can be approximated using the binary entropy function: The term K     1 N N ' 2N H2 (K/N ) , (1.38) ≤ 2N H2 (K/N ) ⇒ 2N H2 (K/N ) ≤ K K N +1 √ where this approximation introduces an error of order N – as shown in equation (1.17). So pb = pB ' 2N (f (1 − f ))N/2 = (4f (1 − f ))N/2 .

(1.39) log 10−15

Setting this equal to the required value of 10 −15 we find N ' 2 log 4f (1−f ) = 68. This answer is a little out because the approximation we used overestimated N and we did not distinguish between dN/2e and N/2. K

  Notation: N/2 denotes the smallest integer greater than or equal to N/2.

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

1.6: Solutions

17

A slightly more careful answer (short of explicit computation) goes as follows.  N to the next order, we find: Taking the approximation for K   1 N ' 2N p . (1.40) N/2 2πN/4

This approximation can be proved from an accurate version of Stirling’s approximation (1.12), or by considering the binomial distribution with p = 1/2 and noting 1=

X N  K

K

2−N ' 2−N



 N/2   X 2 2 N N √ 2πσ, (1.41) e−r /2σ ' 2−N N/2 N/2 r=−N/2

p

where σ = N/4, from which equation (1.40) follows. The distinction between  dN/2e and N/2 is not important in this term since N K has a maximum at K = N/2.

Then the probability of error (for odd N ) is to leading order   N f (N +1)/2 (1 − f )(N −1)/2 (1.42) pb ' (N +1)/2 1 1 ' 2N p f [f (1 − f )](N −1)/2 ' p f [4f (1 − f )](N −1)/2 . (1.43) πN/2 πN/8

The equation pb = 10−15 can be written

log 10−15 + log

(N − 1)/2 '



πN/8 f

log 4f (1 − f )

(1.44)

ˆ1 = 68: which may be solved for N iteratively, the first iteration starting from N ˆ2 − 1)/2 ' (N

−15 + 1.7 ˆ2 ' 60.9. = 29.9 ⇒ N −0.44

(1.45)

This answer is found to be stable, so N ' 61 is the blocklength at which pb ' 10−15 .

Solution to exercise 1.6 (p.13). (a) The probability of block error of the Hamming code is a sum of six terms – the probabilities that 2, 3, 4, 5, 6, or 7 errors occur in one block. pB =

7   X 7 r=2

r

f r (1 − f )7−r .

(1.46)

To leading order, this goes as pB '

  7 2 f = 21f 2 . 2

(1.47)

(b) The probability of bit error of the Hamming code is smaller than the probability of block error because a block error rarely corrupts all bits in the decoded block. The leading-order behaviour is found by considering the outcome in the most probable case where the noise vector has weight two. The decoder will erroneously flip a third bit, so that the modified received vector (of length 7) differs in three bits from the transmitted vector. That means, if we average over all seven bits, the probability that a randomly chosen bit is flipped is 3/7 times the block error probability, to leading order. Now, what we really care about is the probability that

In equation (1.44), the logarithms can be taken to any base, as long as it’s the same base throughout. In equation (1.45), I use base 10.

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

18

1 — Introduction to Information Theory a source bit is flipped. Are parity bits or source bits more likely to be among these three flipped bits, or are all seven bits equally likely to be corrupted when the noise vector has weight two? The Hamming code is in fact completely symmetric in the protection it affords to the seven bits (assuming a binary symmetric channel). [This symmetry can be proved by showing that the role of a parity bit can be exchanged with a source bit and the resulting code is still a (7, 4) Hamming code; see below.] The probability that any one bit ends up corrupted is the same for all seven bits. So the probability of bit error (for the source bits) is simply three sevenths of the probability of block error. 3 pb ' pB ' 9f 2 . 7

(1.48)

Symmetry of the Hamming (7, 4) code To prove that the (7, 4) code protects all bits equally, we start from the paritycheck matrix   1 1 1 0 1 0 0 H =  0 1 1 1 0 1 0 . (1.49) 1 0 1 1 0 0 1

The symmetry among the seven transmitted bits will be easiest to see if we reorder the seven bits using the permutation (t 1 t2 t3 t4 t5 t6 t7 ) → (t5 t2 t3 t4 t1 t6 t7 ). Then we can rewrite H thus:   1 1 1 0 1 0 0 (1.50) H =  0 1 1 1 0 1 0 . 0 0 1 1 1 0 1

Now, if we take any two parity constraints that t satisfies and add them together, we get another parity constraint. For example, row 1 asserts t 5 + t2 + t3 + t1 = even, and row 2 asserts t2 + t3 + t4 + t6 = even, and the sum of these two constraints is t5 + 2t2 + 2t3 + t1 + t4 + t6 = even;

(1.51)

we can drop the terms 2t2 and 2t3 , since they are even whatever t2 and t3 are; thus we have derived the parity constraint t 5 + t1 + t4 + t6 = even, which we can if we wish add into the parity-check matrix as a fourth row. [The set of vectors satisfying Ht = 0 will not be changed.] We thus define   1 1 1 0 1 0 0  0 1 1 1 0 1 0   H0 =  (1.52)  0 0 1 1 1 0 1 . 1 0 0 1 1 1 0 The fourth row is the sum (modulo two) of the top two rows. Notice that the second, third, and fourth rows are all cyclic shifts of the top row. If, having added the fourth redundant constraint, we drop the first constraint, we obtain a new parity-check matrix H00 ,   0 1 1 1 0 1 0 H00 =  0 0 1 1 1 0 1  , (1.53) 1 0 0 1 1 1 0 which still satisfies H00 t = 0 for all codewords, and which looks just like the starting H in (1.50), except that all the columns have shifted along one

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

1.6: Solutions

19

to the right, and the rightmost column has reappeared at the left (a cyclic permutation of the columns). This establishes the symmetry among the seven bits. Iterating the above procedure five more times, we can make a total of seven different H matrices for the same original code, each of which assigns each bit to a different role. We may also construct the super-redundant seven-row parity-check matrix for the code,   1 1 1 0 1 0 0  0 1 1 1 0 1 0     0 0 1 1 1 0 1     (1.54) H000 =   1 0 0 1 1 1 0 .  0 1 0 0 1 1 1     1 0 1 0 0 1 1  1 1 0 1 0 0 1 This matrix is ‘redundant’ in the sense that the space spanned by its rows is only three-dimensional, not seven. This matrix is also a cyclic matrix. Every row is a cyclic permutation of the top row.

Cyclic codes: if there is an ordering of the bits t 1 . . . tN such that a linear code has a cyclic parity-check matrix, then the code is called a cyclic code. The codewords of such a code also have cyclic properties: any cyclic permutation of a codeword is a codeword. For example, the Hamming (7, 4) code, with its bits ordered as above, consists of all seven cyclic shifts of the codewords 1110100 and 1011000, and the codewords 0000000 and 1111111. Cyclic codes are a cornerstone of the algebraic approach to error-correcting codes. We won’t use them again in this book, however, as they have been superceded by sparse-graph codes (Part VI). Solution to exercise 1.7 (p.13). There are fifteen non-zero noise vectors which give the all-zero syndrome; these are precisely the fifteen non-zero codewords of the Hamming code. Notice that because the Hamming code is linear , the sum of any two codewords is a codeword.

Graphs corresponding to codes Solution to exercise 1.9 (p.14). When answering this question, you will probably find that it is easier to invent new codes than to find optimal decoders for them. There are many ways to design codes, and what follows is just one possible train of thought. We make a linear block code that is similar to the (7, 4) Hamming code, but bigger. Many codes can be conveniently expressed in terms of graphs. In figure 1.13, we introduced a pictorial representation of the (7, 4) Hamming code. If we replace that figure’s big circles, each of which shows that the parity of four particular bits is even, by a ‘parity-check node’ that is connected to the four bits, then we obtain the representation of the (7, 4) Hamming code by a bipartite graph as shown in figure 1.20. The 7 circles are the 7 transmitted bits. The 3 squares are the parity-check nodes (not to be confused with the 3 parity-check bits, which are the three most peripheral circles). The graph is a ‘bipartite’ graph because its nodes fall into two classes – bits and checks

Figure 1.20. The graph of the (7, 4) Hamming code. The 7 circles are the bit nodes and the 3 squares are the parity-check nodes.

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

20

1 — Introduction to Information Theory

– and there are edges only between nodes in different classes. The graph and the code’s parity-check matrix (1.30) are simply related to each other: each parity-check node corresponds to a row of H and each bit node corresponds to a column of H; for every 1 in H, there is an edge between the corresponding pair of nodes. Having noticed this connection between linear codes and graphs, one way to invent linear codes is simply to think of a bipartite graph. For example, a pretty bipartite graph can be obtained from a dodecahedron by calling the vertices of the dodecahedron the parity-check nodes, and putting a transmitted bit on each edge in the dodecahedron. This construction defines a paritycheck matrix in which every column has weight 2 and every row has weight 3. [The weight of a binary vector is the number of 1s it contains.] This code has N = 30 bits, and it appears to have M apparent = 20 paritycheck constraints. Actually, there are only M = 19 independent constraints; the 20th constraint is redundant (that is, if 19 constraints are satisfied, then the 20th is automatically satisfied); so the number of source bits is K = N − M = 11. The code is a (30, 11) code. It is hard to find a decoding algorithm for this code, but we can estimate its probability of error by finding its lowest-weight codewords. If we flip all the bits surrounding one face of the original dodecahedron, then all the parity checks will be satisfied; so the code has 12 codewords of weight 5, one for each face. Since the lowest-weight codewords have weight 5, we say that the code has distance d = 5; the (7, 4) Hamming code had distance 3 and could correct all single bit-flip errors. A code with distance 5 can correct all double bit-flip errors, but there are some triple bit-flip errors that it cannot correct. So the error probability of this code, assuming a binary symmetric channel, will be dominated, at least for low noise levels f , by a term of order f 3 , perhaps something like   5 3 12 f (1 − f )27 . (1.55) 3 Of course, there is no obligation to make codes whose graphs can be represented on a plane, as this one can; the best linear codes, which have simple graphical descriptions, have graphs that are more tangled, as illustrated by the tiny (16, 4) code of figure 1.22. Furthermore, there is no reason for sticking to linear codes; indeed some nonlinear codes – codes whose codewords cannot be defined by a linear equation like Ht = 0 – have very good properties. But the encoding and decoding of a nonlinear code are even trickier tasks. Solution to exercise 1.10 (p.14). First let’s assume we are making a linear code and decoding it with syndrome decoding. If there are N transmitted bits, then the number of possible error patterns of weight up to two is       N N N . (1.56) + + 0 1 2 For N = 14, that’s 91 + 14 + 1 = 106 patterns. Now, every distinguishable error pattern must give rise to a distinct syndrome; and the syndrome is a list of M bits, so the maximum possible number of syndromes is 2 M . For a (14, 8) code, M = 6, so there are at most 2 6 = 64 syndromes. The number of possible error patterns of weight up to two, 106, is bigger than the number of syndromes, 64, so we can immediately rule out the possibility that there is a (14, 8) code that is 2-error-correcting.

Figure 1.21. The graph defining the (30, 11) dodecahedron code. The circles are the 30 transmitted bits and the triangles are the 20 parity checks. One parity check is redundant.

Figure 1.22. Graph of a rate-1/4 low-density parity-check code (Gallager code) with blocklength N = 16, and M = 12 parity-check constraints. Each white circle represents a transmitted bit. Each bit participates in j = 3 constraints, represented by squares. The edges between nodes were placed at random. (See Chapter 47 for more.)

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

1.6: Solutions

21

The same counting argument works fine for nonlinear codes too. When the decoder receives r = t + n, his aim is to deduce both t and n from r. If it is the case that the sender can select any transmission t from a code of size St , and the channel can select any noise vector from a set of size S n , and those two selections can be recovered from the received bit string r, which is one of at most 2N possible strings, then it must be the case that St Sn ≤ 2 N .

(1.57)

So, for a (N, K) two-error-correcting code, whether linear or nonlinear,       N N N ≤ 2N . (1.58) 2K + + 0 1 2 Solution to exercise 1.11 (p.14). There are various strategies for making codes that can correct multiple errors, and I strongly recommend you think out one or two of them for yourself. If your approach uses a linear code, e.g., one with a collection of M parity checks, it is helpful to bear in mind the counting argument given in the previous exercise, in order to anticipate how many parity checks, M , you might need. Examples of codes that can correct any two errors are the (30, 11) dodecahedron code on page 20, and the (15, 6) pentagonful code to be introduced on p.221. Further simple ideas for making codes that can correct multiple errors from codes that can correct only one error are discussed in section 13.7. Solution to exercise 1.12 (p.16). The probability of error of R 23 is, to leading order, (1.59) pb (R23 ) ' 3 [pb (R3 )]2 = 3(3f 2 )2 + · · · = 27f 4 + · · · , whereas the probability of error of R 9 is dominated by the probability of five flips,   9 5 pb (R9 ) ' f (1 − f )4 ' 126f 5 + · · · . (1.60) 5 The R23 decoding procedure is therefore suboptimal, since there are noise vectors of weight four that cause it to make a decoding error. It has the advantage, however, of requiring smaller computational resources: only memorization of three bits, and counting up to three, rather than counting up to nine. This simple code illustrates an important concept. Concatenated codes are widely used in practice because concatenation allows large codes to be implemented using simple encoding and decoding hardware. Some of the best known practical codes are concatenated codes.

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

2 Probability, Entropy, and Inference This chapter, and its sibling, Chapter 8, devote some time to notation. Just as the White Knight distinguished between the song, the name of the song, and what the name of the song was called (Carroll, 1998), we will sometimes need to be careful to distinguish between a random variable, the value of the random variable, and the proposition that asserts that the random variable has a particular value. In any particular chapter, however, I will use the most simple and friendly notation possible, at the risk of upsetting pure-minded readers. For example, if something is ‘true with probability 1’, I will usually simply say that it is ‘true’.

2.1 Probabilities and ensembles An ensemble X is a triple (x, AX , PX ), where the outcome x is the value of a random variable, which takes on one of a set of possible values, AX = {a1 , a2 , . . . , ai , . . . , aI }, having P probabilities PX = {p1 , p2 , . . . , pI }, with P (x = ai ) = pi , pi ≥ 0 and ai ∈AX P (x = ai ) = 1. The name A is mnemonic for ‘alphabet’. One example of an ensemble is a letter that is randomly selected from an English document. This ensemble is shown in figure 2.1. There are twenty-seven possible letters: a–z, and a space character ‘-’.

Abbreviations. Briefer notation will sometimes be used. P (x = ai ) may be written as P (ai ) or P (x).

For example,

Probability of a subset. If T is a subset of A X then: X P (T ) = P (x ∈ T ) = P (x = ai ).

(2.1)

ai ∈T

For example, if we define V to be vowels from figure 2.1, V {a, e, i, o, u}, then P (V ) = 0.06 + 0.09 + 0.06 + 0.07 + 0.03 = 0.31.

(2.2)

We call P (x, y) the joint probability of x and y.

N.B. In a joint ensemble XY the two variables are not necessarily independent. 22

ai

pi

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27

a b c d e f g h i j k l m n o p q r s t u v w x y z –

0.0575 0.0128 0.0263 0.0285 0.0913 0.0173 0.0133 0.0313 0.0599 0.0006 0.0084 0.0335 0.0235 0.0596 0.0689 0.0192 0.0008 0.0508 0.0567 0.0706 0.0334 0.0069 0.0119 0.0073 0.0164 0.0007 0.1928

a b c d e f g h i j k l m n o p q r s t u v w x y z –

=

A joint ensemble XY is an ensemble in which each outcome is an ordered pair x, y with x ∈ AX = {a1 , . . . , aI } and y ∈ AY = {b1 , . . . , bJ }. Commas are optional when writing ordered pairs, so xy ⇔ x, y.

i

Figure 2.1. Probability distribution over the 27 outcomes for a randomly selected letter in an English language document (estimated from The Frequently Asked Questions Manual for Linux ). The picture shows the probabilities by the areas of white squares.

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

2.1: Probabilities and ensembles

23

x

Figure 2.2. The probability distribution over the 27×27 possible bigrams xy in an English language document, The Frequently Asked Questions Manual for Linux.

a b c d e f g h i j k l m n o p q r s t u v w x y z – abcdefghijklmnopqrstuvwxyz– y

Marginal probability. We can obtain the marginal probability P (x) from the joint probability P (x, y) by summation: X P (x = ai ) ≡ P (x = ai , y). (2.3) y∈AY

Similarly, using briefer notation, the marginal probability of y is: X P (y) ≡ P (x, y). (2.4) x∈AX

Conditional probability P (x = ai | y = bj ) ≡

P (x = ai , y = bj ) if P (y = bj ) 6= 0. P (y = bj )

(2.5)

[If P (y = bj ) = 0 then P (x = ai | y = bj ) is undefined.]

We pronounce P (x = ai | y = bj ) ‘the probability that x equals ai , given y equals bj ’.

Example 2.1. An example of a joint ensemble is the ordered pair XY consisting of two successive letters in an English document. The possible outcomes are ordered pairs such as aa, ab, ac, and zz; of these, we might expect ab and ac to be more probable than aa and zz. An estimate of the joint probability distribution for two neighbouring characters is shown graphically in figure 2.2. This joint ensemble has the special property that its two marginal distributions, P (x) and P (y), are identical. They are both equal to the monogram distribution shown in figure 2.1. From this joint ensemble P (x, y) we can obtain conditional distributions, P (y | x) and P (x | y), by normalizing the rows and columns, respectively (figure 2.3). The probability P (y | x = q) is the probability distribution of the second letter given that the first letter is a q. As you can see in figure 2.3a, the two most probable values for the second letter y given

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

24

2 — Probability, Entropy, and Inference

x a b c d e f g h i j k l m n o p q r s t u v w x y z –

x a b c d e f g h i j k l m n o p q r s t u v w x y z –

Figure 2.3. Conditional probability distributions. (a) P (y | x): Each row shows the conditional distribution of the second letter, y, given the first letter, x, in a bigram xy. (b) P (x | y): Each column shows the conditional distribution of the first letter, x, given the second letter, y.

abcdefghijklmnopqrstuvwxyz– y

abcdefghijklmnopqrstuvwxyz– y

(a) P (y | x)

(b) P (x | y)

that the first letter x is q are u and -. (The space is common after q because the source document makes heavy use of the word FAQ.) The probability P (x | y = u) is the probability distribution of the first letter x given that the second letter y is a u. As you can see in figure 2.3b the two most probable values for x given y = u are n and o. Rather than writing down the joint probability directly, we often define an ensemble in terms of a collection of conditional probabilities. The following rules of probability theory will be useful. (H denotes assumptions on which the probabilities are based.) Product rule – obtained from the definition of conditional probability: P (x, y | H) = P (x | y, H)P (y | H) = P (y | x, H)P (x | H).

(2.6)

This rule is also known as the chain rule. Sum rule – a rewriting of the marginal probability definition: X P (x | H) = P (x, y | H)

(2.7)

y

=

X y

P (x | y, H)P (y | H).

(2.8)

Bayes’ theorem – obtained from the product rule: P (y | x, H) = =

P (x | y, H)P (y | H) P (x | H) P (x | y, H)P (y | H) P . 0 0 y 0 P (x | y , H)P (y | H)

(2.9) (2.10)

Independence. Two random variables X and Y are independent (sometimes written X⊥Y ) if and only if P (x, y) = P (x)P (y).

(2.11)

Exercise 2.2.[1, p.40] Are the random variables X and Y in the joint ensemble of figure 2.2 independent?

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

2.2: The meaning of probability

25

I said that we often define an ensemble in terms of a collection of conditional probabilities. The following example illustrates this idea. Example 2.3. Jo has a test for a nasty disease. We denote Jo’s state of health by the variable a and the test result by b. a=1 a=0

Jo has the disease Jo does not have the disease.

(2.12)

The result of the test is either ‘positive’ (b = 1) or ‘negative’ (b = 0); the test is 95% reliable: in 95% of cases of people who really have the disease, a positive result is returned, and in 95% of cases of people who do not have the disease, a negative result is obtained. The final piece of background information is that 1% of people of Jo’s age and background have the disease. OK – Jo has the test, and the result is positive. What is the probability that Jo has the disease? Solution. We write down all the provided probabilities. The test reliability specifies the conditional probability of b given a: P (b = 1 | a = 1) = 0.95 P (b = 0 | a = 1) = 0.05

P (b = 1 | a = 0) = 0.05 P (b = 0 | a = 0) = 0.95;

(2.13)

and the disease prevalence tells us about the marginal probability of a: P (a = 1) = 0.01

P (a = 0) = 0.99.

(2.14)

From the marginal P (a) and the conditional probability P (b | a) we can deduce the joint probability P (a, b) = P (a)P (b | a) and any other probabilities we are interested in. For example, by the sum rule, the marginal probability of b = 1 – the probability of getting a positive result – is P (b = 1) = P (b = 1 | a = 1)P (a = 1) + P (b = 1 | a = 0)P (a = 0).

(2.15)

Jo has received a positive result b = 1 and is interested in how plausible it is that she has the disease (i.e., that a = 1). The man in the street might be duped by the statement ‘the test is 95% reliable, so Jo’s positive result implies that there is a 95% chance that Jo has the disease’, but this is incorrect. The correct solution to an inference problem is found using Bayes’ theorem. P (b = 1 | a = 1)P (a = 1) (2.16) P (b = 1 | a = 1)P (a = 1) + P (b = 1 | a = 0)P (a = 0) 0.95 × 0.01 (2.17) = 0.95 × 0.01 + 0.05 × 0.99 = 0.16. (2.18)

P (a = 1 | b = 1) =

So in spite of the positive result, the probability that Jo has the disease is only 16%. 2

2.2 The meaning of probability Probabilities can be used in two ways. Probabilities can describe frequencies of outcomes in random experiments, but giving noncircular definitions of the terms ‘frequency’ and ‘random’ is a challenge – what does it mean to say that the frequency of a tossed coin’s

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

26

2 — Probability, Entropy, and Inference

Notation. Let ‘the degree of belief in proposition x’ be denoted by B(x). The negation of x (not-x) is written x. The degree of belief in a conditional proposition, ‘x, assuming proposition y to be true’, is represented by B(x | y). Axiom 1. Degrees of belief can be ordered; if B(x) is ‘greater’ than B(y), and B(y) is ‘greater’ than B(z), then B(x) is ‘greater’ than B(z). [Consequence: beliefs can be mapped onto real numbers.] Axiom 2. The degree of belief in a proposition x and its negation x are related. There is a function f such that B(x) = f [B(x)]. Axiom 3. The degree of belief in a conjunction of propositions x, y (x and y) is related to the degree of belief in the conditional proposition x | y and the degree of belief in the proposition y. There is a function g such that B(x, y) = g [B(x | y), B(y)] .

coming up heads is 1/2? If we say that this frequency is the average fraction of heads in long sequences, we have to define ‘average’; and it is hard to define ‘average’ without using a word synonymous to probability! I will not attempt to cut this philosophical knot. Probabilities can also be used, more generally, to describe degrees of belief in propositions that do not involve random variables – for example ‘the probability that Mr. S. was the murderer of Mrs. S., given the evidence’ (he either was or wasn’t, and it’s the jury’s job to assess how probable it is that he was); ‘the probability that Thomas Jefferson had a child by one of his slaves’; ‘the probability that Shakespeare’s plays were written by Francis Bacon’; or, to pick a modern-day example, ‘the probability that a particular signature on a particular cheque is genuine’. The man in the street is happy to use probabilities in both these ways, but some books on probability restrict probabilities to refer only to frequencies of outcomes in repeatable random experiments. Nevertheless, degrees of belief can be mapped onto probabilities if they satisfy simple consistency rules known as the Cox axioms (Cox, 1946) (figure 2.4). Thus probabilities can be used to describe assumptions, and to describe inferences given those assumptions. The rules of probability ensure that if two people make the same assumptions and receive the same data then they will draw identical conclusions. This more general use of probability to quantify beliefs is known as the Bayesian viewpoint. It is also known as the subjective interpretation of probability, since the probabilities depend on assumptions. Advocates of a Bayesian approach to data modelling and pattern recognition do not view this subjectivity as a defect, since in their view, you cannot do inference without making assumptions. In this book it will from time to time be taken for granted that a Bayesian approach makes sense, but the reader is warned that this is not yet a globally held view – the field of statistics was dominated for most of the 20th century by non-Bayesian methods in which probabilities are allowed to describe only random variables. The big difference between the two approaches is that

Box 2.4. The Cox axioms. If a set of beliefs satisfy these axioms then they can be mapped onto probabilities satisfying P (false) = 0, P (true) = 1, 0 ≤ P (x) ≤ 1, and the rules of probability: P (x) = 1 − P (x), and P (x, y) = P (x | y)P (y).

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

2.3: Forward probabilities and inverse probabilities

27

Bayesians also use probabilities to describe inferences.

2.3 Forward probabilities and inverse probabilities Probability calculations often fall into one of two categories: forward probability and inverse probability. Here is an example of a forward probability problem: Exercise 2.4.[2, p.40] An urn contains K balls, of which B are black and W = K − B are white. Fred draws a ball at random from the urn and replaces it, N times. (a) What is the probability distribution of the number of times a black ball is drawn, nB ? (b) What is the expectation of nB ? What is the variance of nB ? What is the standard deviation of nB ? Give numerical answers for the cases N = 5 and N = 400, when B = 2 and K = 10. Forward probability problems involve a generative model that describes a process that is assumed to give rise to some data; the task is to compute the probability distribution or expectation of some quantity that depends on the data. Here is another example of a forward probability problem: Exercise 2.5.[2, p.40] An urn contains K balls, of which B are black and W = K − B are white. We define the fraction f B ≡ B/K. Fred draws N times from the urn, exactly as in exercise 2.4, obtaining n B blacks, and computes the quantity z=

(nB − fB N )2 . N fB (1 − fB )

(2.19)

What is the expectation of z? In the case N = 5 and f B = 1/5, what is the probability distribution of z? What is the probability that z < 1? [Hint: compare z with the quantities computed in the previous exercise.] Like forward probability problems, inverse probability problems involve a generative model of a process, but instead of computing the probability distribution of some quantity produced by the process, we compute the conditional probability of one or more of the unobserved variables in the process, given the observed variables. This invariably requires the use of Bayes’ theorem. Example 2.6. There are eleven urns labelled by u ∈ {0, 1, 2, . . . , 10}, each containing ten balls. Urn u contains u black balls and 10 − u white balls. Fred selects an urn u at random and draws N times with replacement from that urn, obtaining nB blacks and N − nB whites. Fred’s friend, Bill, looks on. If after N = 10 draws n B = 3 blacks have been drawn, what is the probability that the urn Fred is using is urn u, from Bill’s point of view? (Bill doesn’t know the value of u.) Solution. The joint probability distribution of the random variables u and n B can be written P (u, nB | N ) = P (nB | u, N )P (u). (2.20)

From the joint probability of u and n B , we can obtain the conditional distribution of u given nB : P (u | nB , N ) = =

P (u, nB | N ) P (nB | N ) P (nB | u, N )P (u) . P (nB | N )

(2.21) (2.22)

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

28

2 — Probability, Entropy, and Inference u

Figure 2.5. Joint probability of u and nB for Bill and Fred’s urn problem, after N = 10 draws.

0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 nB 1 for all u. You wrote down the The marginal probability of u is P (u) = 11 probability of nB given u and N , P (nB | u, N ), when you solved exercise 2.4 (p.27). [You are doing the highly recommended exercises, aren’t you?] If we define fu ≡ u/10 then

P (nB | u, N ) =



 N f nB (1 − fu )N −nB . nB u

(2.23)

P (nB | N ) =

u

P (u, nB | N ) =

X u

P (u)P (nB | u, N ).

(2.24)

So the conditional probability of u given n B is P (u | nB , N ) = =

P (u)P (nB | u, N ) P (nB | N )   1 1 N f nB (1 − fu )N −nB . P (nB | N ) 11 nB u

0.2 0.15 0.1 0.05 0 0

What about the denominator, P (nB | N )? This is the marginal probability of nB , which we can obtain using the sum rule: X

0.3 0.25

(2.25) (2.26)

This conditional distribution can be found by normalizing column 3 of figure 2.5 and is shown in figure 2.6. The normalizing constant, the marginal probability of nB , is P (nB = 3 | N = 10) = 0.083. The posterior probability (2.26) is correct for all u, including the end-points u = 0 and u = 10, where fu = 0 and fu = 1 respectively. The posterior probability that u = 0 given nB = 3 is equal to zero, because if Fred were drawing from urn 0 it would be impossible for any black balls to be drawn. The posterior probability that u = 10 is also zero, because there are no white balls in that urn. The other hypotheses u = 1, u = 2, . . . u = 9 all have non-zero posterior probability. 2

Terminology of inverse probability In inverse probability problems it is convenient to give names to the probabilities appearing in Bayes’ theorem. In equation (2.25), we call the marginal probability P (u) the prior probability of u, and P (n B | u, N ) is called the likelihood of u. It is important to note that the terms likelihood and probability are not synonyms. The quantity P (nB | u, N ) is a function of both nB and u. For fixed u, P (nB | u, N ) defines a probability over nB . For fixed nB , P (nB | u, N ) defines the likelihood of u.

1

2

3

4

5 u

6

7

8

9 10

u

P (u | nB = 3, N )

0 1 2 3 4 5 6 7 8 9 10

0 0.063 0.22 0.29 0.24 0.13 0.047 0.0099 0.00086 0.0000096 0

Figure 2.6. Conditional probability of u given nB = 3 and N = 10.

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

2.3: Forward probabilities and inverse probabilities

29

Never say ‘the likelihood of the data’. Always say ‘the likelihood of the parameters’. The likelihood function is not a probability distribution. (If you want to mention the data that a likelihood function is associated with, you may say ‘the likelihood of the parameters given the data’.) The conditional probability P (u | n B , N ) is called the posterior probability of u given nB . The normalizing constant P (nB | N ) has no u-dependence so its value is not important if we simply wish to evaluate the relative probabilities of the alternative hypotheses u. However, in most data-modelling problems of any complexity, this quantity becomes important, and it is given various names: P (nB | N ) is known as the evidence or the marginal likelihood. If θ denotes the unknown parameters, D denotes the data, and H denotes the overall hypothesis space, the general equation: P (D | θ, H)P (θ | H) P (D | H)

(2.27)

likelihood × prior . evidence

(2.28)

P (θ | D, H) = is written: posterior =

Inverse probability and prediction Example 2.6 (continued). Assuming again that Bill has observed n B = 3 blacks in N = 10 draws, let Fred draw another ball from the same urn. What is the probability that the next drawn ball is a black? [You should make use of the posterior probabilities in figure 2.6.] Solution. By the sum rule, P (ballN+1 is black | nB , N ) =

X u

P (ballN+1 is black | u, nB , N )P (u | nB , N ).

(2.29) Since the balls are drawn with replacement from the chosen urn, the probability P (ballN+1 is black | u, nB , N ) is just fu = u/10, whatever nB and N are. So X P (ballN+1 is black | nB , N ) = fu P (u | nB , N ). (2.30) u

Using the values of P (u | nB , N ) given in figure 2.6 we obtain P (ballN+1 is black | nB = 3, N = 10) = 0.333.

2

(2.31)

Comment. Notice the difference between this prediction obtained using probability theory, and the widespread practice in statistics of making predictions by first selecting the most plausible hypothesis (which here would be that the urn is urn u = 3) and then making the predictions assuming that hypothesis to be true (which would give a probability of 0.3 that the next ball is black). The correct prediction is the one that takes into account the uncertainty by marginalizing over the possible values of the hypothesis u. Marginalization here leads to slightly more moderate, less extreme predictions.

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

30

2 — Probability, Entropy, and Inference

Inference as inverse probability Now consider the following exercise, which has the character of a simple scientific investigation. Example 2.7. Bill tosses a bent coin N times, obtaining a sequence of heads and tails. We assume that the coin has a probability f H of coming up heads; we do not know fH . If nH heads have occurred in N tosses, what is the probability distribution of f H ? (For example, N might be 10, and nH might be 3; or, after a lot more tossing, we might have N = 300 and nH = 29.) What is the probability that the N +1th outcome will be a head, given nH heads in N tosses? Unlike example 2.6 (p.27), this problem has a subjective element. Given a restricted definition of probability that says ‘probabilities are the frequencies of random variables’, this example is different from the eleven-urns example. Whereas the urn u was a random variable, the bias f H of the coin would not normally be called a random variable. It is just a fixed but unknown parameter that we are interested in. Yet don’t the two examples 2.6 and 2.7 seem to have an essential similarity? [Especially when N = 10 and n H = 3!] To solve example 2.7, we have to make an assumption about what the bias of the coin fH might be. This prior probability distribution over f H , P (fH ), corresponds to the prior over u in the eleven-urns problem. In that example, the helpful problem definition specified P (u). In real life, we have to make assumptions in order to assign priors; these assumptions will be subjective, and our answers will depend on them. Exactly the same can be said for the other probabilities in our generative model too. We are assuming, for example, that the balls are drawn from an urn independently; but could there not be correlations in the sequence because Fred’s ball-drawing action is not perfectly random? Indeed there could be, so the likelihood function that we use depends on assumptions too. In real data modelling problems, priors are subjective and so are likelihoods. We are now using P () to denote probability densities over continuous variables as well as probabilities over discrete variables and probabilities of logical propositions. The probability that a continuous variable v lies between values Rb a and b (where b > a) is defined to be a dv P (v). P (v)dv is dimensionless. The density P (v) is a dimensional quantity, having dimensions inverse to the dimensions of v – in contrast to discrete probabilities, which are dimensionless. Don’t be surprised to see probability densities greater than 1. This is normal, Rb and nothing is wrong, as long as a dv P (v) ≤ 1 for any interval (a, b). Conditional and joint probability densities are defined in just the same way as conditional and joint probabilities.

. Exercise 2.8.[2 ] Assuming a uniform prior on fH , P (fH ) = 1, solve the problem posed in example 2.7 (p.30). Sketch the posterior distribution of f H and compute the probability that the N +1th outcome will be a head, for (a) N = 3 and nH = 0; (b) N = 3 and nH = 2; (c) N = 10 and nH = 3; (d) N = 300 and nH = 29. You will find the beta integral useful: Z 1 Fa !Fb ! Γ(Fa + 1)Γ(Fb + 1) = . (2.32) dpa pFa a (1 − pa )Fb = Γ(F + F + 2) (F + Fb + 1)! a a b 0

Here P (f ) denotes a probability density, rather than a probability distribution.

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

2.3: Forward probabilities and inverse probabilities

31

You may also find it instructive to look back at example 2.6 (p.27) and equation (2.31). People sometimes confuse assigning a prior distribution to an unknown parameter such as fH with making an initial guess of the value of the parameter. But the prior over fH , P (fH ), is not a simple statement like ‘initially, I would guess fH = 1/2’. The prior is a probability density over f H which specifies the prior degree of belief that fH lies in any interval (f, f + δf ). It may well be the case that our prior for fH is symmetric about 1/2, so that the mean of fH under the prior is 1/2. In this case, the predictive distribution for the first toss x1 would indeed be Z Z P (x1 = head) = dfH P (fH )P (x1 = head | fH ) = dfH P (fH )fH = 1/2.

(2.33) But the prediction for subsequent tosses will depend on the whole prior distribution, not just its mean. Data compression and inverse probability Consider the following task. Example 2.9. Write a computer program capable of compressing binary files like this one:

0000000000000000000010010001000000100000010000000000000000000000000000000000001010000000000000110000 1000000000010000100000000010000000000000000000000100000000000000000100000000011000001000000011000100 0000000001001000000000010001000000000000000011000000000000000000000000000010000000000000000100000000

The string shown contains n1 = 29 1s and n0 = 271 0s. Intuitively, compression works by taking advantage of the predictability of a file. In this case, the source of the file appears more likely to emit 0s than 1s. A data compression program that compresses this file must, implicitly or explicitly, be addressing the question ‘What is the probability that the next character in this file is a 1?’ Do you think this problem is similar in character to example 2.7 (p.30)? I do. One of the themes of this book is that data compression and data modelling are one and the same, and that they should both be addressed, like the urn of example 2.6, using inverse probability. Example 2.9 is solved in Chapter 6.

The likelihood principle Please solve the following two exercises. Example 2.10. Urn A contains three balls: one black, and two white; urn B contains three balls: two black, and one white. One of the urns is selected at random and one ball is drawn. The ball is black. What is the probability that the selected urn is urn A? Example 2.11. Urn A contains five balls: one black, two white, one green and one pink; urn B contains five hundred balls: two hundred black, one hundred white, 50 yellow, 40 cyan, 30 sienna, 25 green, 25 silver, 20 gold, and 10 purple. [One fifth of A’s balls are black; two-fifths of B’s are black.] One of the urns is selected at random and one ball is drawn. The ball is black. What is the probability that the urn is urn A?

A

B

Figure 2.7. Urns for example 2.10.

c g p

y

... ... ...

g p s

Figure 2.8. Urns for example 2.11.

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

32

2 — Probability, Entropy, and Inference

What do you notice about your solutions? Does each answer depend on the detailed contents of each urn? The details of the other possible outcomes and their probabilities are irrelevant. All that matters is the probability of the outcome that actually happened (here, that the ball drawn was black) given the different hypotheses. We need only to know the likelihood, i.e., how the probability of the data that happened varies with the hypothesis. This simple rule about inference is known as the likelihood principle. The likelihood principle: given a generative model for data d given parameters θ, P (d | θ), and having observed a particular outcome d1 , all inferences and predictions should depend only on the function P (d1 | θ). In spite of the simplicity of this principle, many classical statistical methods violate it.

2.4 Definition of entropy and related functions The Shannon information content of an outcome x is defined to be h(x) = log 2

1 . P (x)

(2.34)

It is measured in bits. [The word ‘bit’ is also used to denote a variable whose value is 0 or 1; I hope context will always make clear which of the two meanings is intended.] In the next few chapters, we will establish that the Shannon information content h(ai ) is indeed a natural measure of the information content of the event x = ai . At that point, we will shorten the name of this quantity to ‘the information content’. The fourth column in table 2.9 shows the Shannon information content of the 27 possible outcomes when a random character is picked from an English document. The outcome x = z has a Shannon information content of 10.4 bits, and x = e has an information content of 3.5 bits. The entropy of an ensemble X is defined to be the average Shannon information content of an outcome: X 1 H(X) ≡ P (x) log , (2.35) P (x) x∈AX

with the convention limθ→0+ θ log 1/θ = 0.

for

P (x) = 0

that

0 × log 1/0 ≡ 0,

since

Like the information content, entropy is measured in bits. When it is convenient, we may also write H(X) as H(p), where p is the vector (p1 , p2 , . . . , pI ). Another name for the entropy of X is the uncertainty of X. Example 2.12. The entropy of a randomly selected letter in an English document is about 4.11 bits, assuming its probability is as given in table 2.9. We obtain this number by averaging log 1/p i (shown in the fourth column) under the probability distribution p i (shown in the third column).

i

ai

pi

h(pi )

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27

a b c d e f g h i j k l m n o p q r s t u v w x y z -

.0575 .0128 .0263 .0285 .0913 .0173 .0133 .0313 .0599 .0006 .0084 .0335 .0235 .0596 .0689 .0192 .0008 .0508 .0567 .0706 .0334 .0069 .0119 .0073 .0164 .0007 .1928

4.1 6.3 5.2 5.1 3.5 5.9 6.2 5.0 4.1 10.7 6.9 4.9 5.4 4.1 3.9 5.7 10.3 4.3 4.1 3.8 4.9 7.2 6.4 7.1 5.9 10.4 2.4

1 pi

4.1

X i

pi log2

Table 2.9. Shannon information contents of the outcomes a–z.

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

2.5: Decomposability of the entropy

33

We now note some properties of the entropy function. • H(X) ≥ 0 with equality iff pi = 1 for one i. [‘iff’ means ‘if and only if’.] • Entropy is maximized if p is uniform: H(X) ≤ log(|AX |) with equality iff pi = 1/|AX | for all i.

(2.36)

Notation: the vertical bars ‘| · |’ have two meanings. If A X is a set, |AX | denotes the number of elements in AX ; if x is a number, then |x| is the absolute value of x. The redundancy measures the fractional difference between H(X) and its maximum possible value, log(|AX |). The redundancy of X is: 1−

H(X) . log |AX |

(2.37)

We won’t make use of ‘redundancy’ in this book, so I have not assigned a symbol to it. The joint entropy of X, Y is: H(X, Y ) =

X

xy∈AX AY

P (x, y) log

1 . P (x, y)

(2.38)

Entropy is additive for independent random variables: H(X, Y ) = H(X) + H(Y ) iff P (x, y) = P (x)P (y).

(2.39)

Our definitions for information content so far apply only to discrete probability distributions over finite sets AX . The definitions can be extended to infinite sets, though the entropy may then be infinite. The case of a probability density over a continuous set is addressed in section 11.3. Further important definitions and exercises to do with entropy will come along in section 8.1.

2.5 Decomposability of the entropy The entropy function satisfies a recursive property that can be very useful when computing entropies. For convenience, we’ll stretch our notation so that we can write H(X) as H(p), where p is the probability vector associated with the ensemble X. Let’s illustrate the property by an example first. Imagine that a random variable x ∈ {0, 1, 2} is created by first flipping a fair coin to determine whether x = 0; then, if x is not 0, flipping a fair coin a second time to determine whether x is 1 or 2. The probability distribution of x is 1 1 1 P (x = 0) = ; P (x = 1) = ; P (x = 2) = . 2 4 4

(2.40)

What is the entropy of X? We can either compute it by brute force: H(X) = 1/2 log 2 + 1/4 log 4 + 1/4 log 4 = 1.5;

(2.41)

or we can use the following decomposition, in which the value of x is revealed gradually. Imagine first learning whether x = 0, and then, if x is not 0, learning which non-zero value is the case. The revelation of whether x = 0 or not entails

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

34

2 — Probability, Entropy, and Inference

revealing a binary variable whose probability distribution is { 1/2, 1/2}. This revelation has an entropy H(1/2, 1/2) = 12 log 2 + 12 log 2 = 1 bit. If x is not 0, we learn the value of the second coin flip. This too is a binary variable whose probability distribution is {1/2, 1/2}, and whose entropy is 1 bit. We only get to experience the second revelation half the time, however, so the entropy can be written: H(X) = H(1/2, 1/2) + 1/2 H(1/2, 1/2). (2.42) Generalizing, the observation we are making about the entropy of any probability distribution p = {p1 , p2 , . . . , pI } is that   p3 pI p2 , ,..., . (2.43) H(p) = H(p1 , 1−p1 ) + (1−p1 )H 1−p1 1−p1 1−p1 When it’s written as a formula, this property looks regrettably ugly; nevertheless it is a simple property and one that you should make use of. Generalizing further, the entropy has the property for any m that H(p) = H [(p1 + p2 + · · · + pm ), (pm+1 + pm+2 + · · · + pI )]   pm p1 ,..., +(p1 + · · · + pm )H (p1 + · · · + pm ) (p1 + · · · + pm )   pI pm+1 +(pm+1 + · · · + pI )H ,..., . (pm+1 + · · · + pI ) (pm+1 + · · · + pI ) (2.44) Example 2.13. A source produces a character x from the alphabet A = {0, 1, . . . , 9, a, b, . . . , z}; with probability 1/3, x is a numeral (0, . . . , 9); with probability 1/3, x is a vowel (a, e, i, o, u); and with probability 1/3 it’s one of the 21 consonants. All numerals are equiprobable, and the same goes for vowels and consonants. Estimate the entropy of X. Solution. log 3 + 13 (log 10 + log 5 + log 21) = log 3 +

1 3

log 1050 ' log 30 bits. 2

2.6 Gibbs’ inequality The relative entropy or Kullback–Leibler divergence between two probability distributions P (x) and Q(x) that are defined over the same alphabet AX is DKL (P ||Q) =

X

P (x) log

x

P (x) . Q(x)

(2.45)

The relative entropy satisfies Gibbs’ inequality DKL (P ||Q) ≥ 0

(2.46)

with equality only if P = Q. Note that in general the relative entropy is not symmetric under interchange of the distributions P and Q: in general DKL (P ||Q) 6= DKL (Q||P ), so DKL , although it is sometimes called the ‘KL distance’, is not strictly a distance. The relative entropy is important in pattern recognition and neural networks, as well as in information theory. Gibbs’ inequality is probably the most important inequality in this book. It, and many other inequalities, can be proved using the concept of convexity.

The ‘ei’ in Leibler is pronounced the same as in heist.

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

2.7: Jensen’s inequality for convex functions

35

2.7 Jensen’s inequality for convex functions The words ‘convex ^’ and ‘concave _’ may be pronounced ‘convex-smile’ and ‘concave-frown’. This terminology has useful redundancy: while one may forget which way up ‘convex’ and ‘concave’ are, it is harder to confuse a smile with a frown.

Convex ^ functions. A function f (x) is convex ^ over (a, b) if every chord of the function lies above the function, as shown in figure 2.10; that is, for all x1 , x2 ∈ (a, b) and 0 ≤ λ ≤ 1, f (λx1 + (1 − λ)x2 ) ≤

λf (x1 ) + (1 − λ)f (x2 ).

(2.47)

A function f is strictly convex ^ if, for all x 1 , x2 ∈ (a, b), the equality holds only for λ = 0 and λ = 1.

λf (x1 ) + (1 − λ)f (x2 ) f (x∗ ) x1

x2 x∗ = λx1 + (1 − λ)x2

Figure 2.10. Definition of convexity.

Similar definitions apply to concave _ and strictly concave _ functions. Some strictly convex ^ functions are • x2 , ex and e−x for all x; • log(1/x) and x log x for x > 0.

-1

0

1

log x1

e−x

x2

2

3

-1

0

1

2

3

0

1

x log x

2

3 0

1

2

Figure 2.11. Convex ^ functions.

3

Jensen’s inequality. If f is a convex ^ function and x is a random variable then: E [f (x)] ≥ f (E[x]) , (2.48) where E denotes expectation. If f is strictly convex ^ and E [f (x)] = f (E[x]), then the random variable x is a constant. Jensen’s inequality can also be rewritten for a concave _ function, with the direction of the inequality reversed. A physical version of Jensen’s inequality runs as follows. If a collection of masses pi are placed on a convex ^ curve f (x) at locations (xi , f (xi )), then the centre of gravity of those masses, which is at (E[x], E [f (x)]), lies above the curve. If this fails to convince you, then feel free to do the following exercise. Exercise 2.14.[2, p.41] Prove Jensen’s inequality. Example 2.15. Three squares have average area A¯ = 100 m2 . The average of the lengths of their sides is ¯l = 10 m. What can be said about the size of the largest of the three squares? [Use Jensen’s inequality.] Solution. Let x be the length of the side of a square, and let the probability of x be 1/3, 1/3, 1/3 over the three lengths l1 , l2 , l3 . Then the information that we have is that E [x] = 10 and E [f (x)] = 100, where f (x) = x 2 is the function mapping lengths to areas. This is a strictly convex ^ function. We notice that the equality E [f (x)] = f (E[x]) holds, therefore x is a constant, and the three lengths must all be equal. The area of the largest square is 100 m 2 . 2

Centre of gravity

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

36

2 — Probability, Entropy, and Inference

Convexity and concavity also relate to maximization If f (x) is concave _ and there exists a point at which ∂f = 0 for all k, ∂xk

(2.49)

then f (x) has its maximum value at that point. The converse does not hold: if a concave _ f (x) is maximized at some x it is not necessarily true that the gradient ∇f (x) is equal to zero there. For example, f (x) = −|x| is maximized at x = 0 where its derivative is undefined; and f (p) = log(p), for a probability p ∈ (0, 1), is maximized on the boundary of the range, at p = 1, where the gradient df (p)/dp = 1.

2.8 Exercises Sums of random variables Exercise 2.16.[3, p.41] (a) Two ordinary dice with faces labelled 1, . . . , 6 are thrown. What is the probability distribution of the sum of the values? What is the probability distribution of the absolute difference between the values? (b) One hundred ordinary dice are thrown. What, roughly, is the probability distribution of the sum of the values? Sketch the probability distribution and estimate its mean and standard deviation. (c) How can two cubical dice be labelled using the numbers {0, 1, 2, 3, 4, 5, 6} so that when the two dice are thrown the sum has a uniform probability distribution over the integers 1–12? (d) Is there any way that one hundred dice could be labelled with integers such that the probability distribution of the sum is uniform?

Inference problems Exercise 2.17.[2, p.41] If q = 1 − p and a = ln p/q, show that p=

1 . 1 + exp(−a)

(2.50)

Sketch this function and find its relationship to the hyperbolic tangent u −u function tanh(u) = eeu −e +e−u . It will be useful to be fluent in base-2 logarithms also. If b = log 2 p/q, what is p as a function of b? . Exercise 2.18.[2, p.42] Let x and y be dependent random variables with x a binary variable taking values in AX = {0, 1}. Use Bayes’ theorem to show that the log posterior probability ratio for x given y is log

P (x = 1 | y) P (y | x = 1) P (x = 1) = log + log . P (x = 0 | y) P (y | x = 0) P (x = 0)

(2.51)

. Exercise 2.19.[2, p.42] Let x, d1 and d2 be random variables such that d1 and d2 are conditionally independent given a binary variable x. Use Bayes’ theorem to show that the posterior probability ratio for x given {d i } is P (d1 | x = 1) P (d2 | x = 1) P (x = 1) P (x = 1 | {di }) = . P (x = 0 | {di }) P (d1 | x = 0) P (d2 | x = 0) P (x = 0)

(2.52)

This exercise is intended to help you think about the central-limit theorem, which says that if independent random variables x1 , x2 , . . . , xN have means µn and finite variances σn2 , then, in Pthe limit of large N , the sum n xn has a distribution that tends to a normal (Gaussian) distribution P with n µn and variance P mean 2 n σn .

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

2.8: Exercises

37

Life in high-dimensional spaces Probability distributions and volumes have some unexpected properties in high-dimensional spaces. Exercise 2.20.[2, p.42] Consider a sphere of radius r in an N -dimensional real space. Show that the fraction of the volume of the sphere that is in the surface shell lying at values of the radius between r −  and r, where 0 <  < r, is:   N f =1− 1− . (2.53) r Evaluate f for the cases N = 2, N = 10 and N = 1000, with (a) /r = 0.01; (b) /r = 0.5. Implication: points that are uniformly distributed in a sphere in N dimensions, where N is large, are very likely to be in a thin shell near the surface.

Expectations and entropies You are probably familiar with the idea of computing the expectation of a function of x, X E [f (x)] = hf (x)i = P (x)f (x). (2.54) x

Maybe you are not so comfortable with computing this expectation in cases where the function f (x) depends on the probability P (x). The next few examples address this concern.

Exercise 2.21.[1, p.43] Let pa = 0.1, pb = 0.2, and pc = 0.7. Let f (a) = 10, f (b) = 5, and f (c) = 10/7. What is E [f (x)]? What is E [1/P (x)]? Exercise 2.22.[2, p.43] For an arbitrary ensemble, what is E [1/P (x)]? . Exercise 2.23.[1, p.43] Let pa = 0.1, pb = 0.2, and pc = 0.7. Let g(a) = 0, g(b) = 1, and g(c) = 0. What is E [g(x)]? . Exercise 2.24.[1, p.43] Let pa = 0.1, pb = 0.2, and pc = 0.7. What is the probability that P (x) ∈ [0.15, 0.5]? What is   P (x) P log > 0.05 ? 0.2 Exercise 2.25.[3, p.43] Prove the assertion that H(X) ≤ log(|A X |) with equality iff pi = 1/|AX | for all i. (|AX | denotes the number of elements in the set AX .) [Hint: use Jensen’s inequality (2.48); if your first attempt to use Jensen does not succeed, remember that Jensen involves both a random variable and a function, and you have quite a lot of freedom in choosing these; think about whether your chosen function f should be convex or concave.]

. Exercise 2.26.[3, p.44] Prove that the relative entropy (equation (2.45)) satisfies DKL (P ||Q) ≥ 0 (Gibbs’ inequality) with equality only if P = Q. . Exercise 2.27.[2 ] Prove that the entropy is indeed decomposable as described in equations (2.43–2.44).

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

38

2 — Probability, Entropy, and Inference

. Exercise 2.28.[2, p.45] A random variable x ∈ {0, 1, 2, 3} is selected by flipping a bent coin with bias f to determine whether the outcome is in {0, 1} or {2, 3}; then either flipping a second bent coin with bias g or a third bent coin with bias h respectively. Write down the probability distribution of x. Use the decomposability of the entropy (2.44) to find the entropy of X. [Notice how compact an expression is obtained if you make use of the binary entropy function H2 (x), compared with writing out the four-term entropy explicitly.] Find the derivative of H(X) with respect to f . [Hint: dH2 (x)/dx = log((1 − x)/x).] . Exercise 2.29.[2, p.45] An unbiased coin is flipped until one head is thrown. What is the entropy of the random variable x ∈ {1, 2, 3, . . .}, the number of flips? Repeat the calculation for the case of a biased coin with probability f of coming up heads. [Hint: solve the problem both directly and by using the decomposability of the entropy (2.43).]

2.9 Further exercises Forward probability . Exercise 2.30.[1 ] An urn contains w white balls and b black balls. Two balls are drawn, one after the other, without replacement. Prove that the probability that the first ball is white is equal to the probability that the second is white. . Exercise 2.31.[2 ] A circular coin of diameter a is thrown onto a square grid whose squares are b × b. (a < b) What is the probability that the coin will lie entirely within one square? [Ans: (1 − a/b) 2 ] . Exercise 2.32.[3 ] Buffon’s needle. A needle of length a is thrown onto a plane covered with equally spaced parallel lines with separation b. What is the probability that the needle will cross a line? [Ans, if a < b: 2a/πb] [Generalization – Buffon’s noodle: on average, a random curve of length A is expected to intersect the lines 2A/πb times.] Exercise 2.33.[2 ] Two points are selected at random on a straight line segment of length 1. What is the probability that a triangle can be constructed out of the three resulting segments? Exercise 2.34.[2, p.45] An unbiased coin is flipped until one head is thrown. What is the expected number of tails and the expected number of heads? Fred, who doesn’t know that the coin is unbiased, estimates the bias using fˆ ≡ h/(h + t), where h and t are the numbers of heads and tails ˆ tossed. Compute and sketch the probability distribution of f. N.B., this is a forward probability problem, a sampling theory problem, not an inference problem. Don’t use Bayes’ theorem. Exercise 2.35.[2, p.45] Fred rolls an unbiased six-sided die once per second, noting the occasions when the outcome is a six. (a) What is the mean number of rolls from one six to the next six? (b) Between two rolls, the clock strikes one. What is the mean number of rolls until the next six?

g 0  f

@ 1−gR @ 1

@ 1−f @ h 2 R @  @ @ 3 1−hR

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

2.9: Further exercises

39

(c) Now think back before the clock struck. What is the mean number of rolls, going back in time, until the most recent six? (d) What is the mean number of rolls from the six before the clock struck to the next six? (e) Is your answer to (d) different from your answer to (a)? Explain. Another version of this exercise refers to Fred waiting for a bus at a bus-stop in Poissonville where buses arrive independently at random (a Poisson process), with, on average, one bus every six minutes. What is the average wait for a bus, after Fred arrives at the stop? [6 minutes.] So what is the time between the two buses, the one that Fred just missed, and the one that he catches? [12 minutes.] Explain the apparent paradox. Note the contrast with the situation in Clockville, where the buses are spaced exactly 6 minutes apart. There, as you can confirm, the mean wait at a bus-stop is 3 minutes, and the time between the missed bus and the next one is 6 minutes.

Conditional probability . Exercise 2.36.[2 ] You meet Fred. Fred tells you he has two brothers, Alf and Bob. What is the probability that Fred is older than Bob? Fred tells you that he is older than Alf. Now, what is the probability that Fred is older than Bob? (That is, what is the conditional probability that F > B given that F > A?) . Exercise 2.37.[2 ] The inhabitants of an island tell the truth one third of the time. They lie with probability 2/3. On an occasion, after one of them made a statement, you ask another ‘was that statement true?’ and he says ‘yes’. What is the probability that the statement was indeed true? . Exercise 2.38.[2, p.46] Compare two ways of computing the probability of error of the repetition code R3 , assuming a binary symmetric channel (you did this once for exercise 1.2 (p.7)) and confirm that they give the same answer. Binomial distribution method. Add the probability that all three bits are flipped to the probability that exactly two bits are flipped. Sum rule method. Using the sum rule, compute the marginal probability that P r takes on each of the eight possible values, P (r). [P (r) = s P (s)P (r | s).] Then compute the posterior probability of s for each of the eight values of r. [In fact, by symmetry, only two example cases r = (000) and r = (001) need be considered.] Notice that some of the inferred bits are better determined than others. From the posterior probability P (s | r) you can read out the case-by-case error probability, the probability that the more probable hypothesis is not correct, P (error | r). Find the average error probability using the sum rule, X P (error) = P (r)P (error | r). (2.55) r

Equation (1.18) gives the posterior probability of the input s, given the received vector r.

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

40

2 — Probability, Entropy, and Inference

. Exercise 2.39.[3C, p.46] The frequency pn of the nth most frequent word in English is roughly approximated by 

pn '

0.1 n

0

for n ∈ 1, . . . , 12 367 n > 12 367.

(2.56)

[This remarkable 1/n law is known as Zipf’s law, and applies to the word frequencies of many languages (Zipf, 1949).] If we assume that English is generated by picking words at random according to this distribution, what is the entropy of English (per word)? [This calculation can be found in ‘Prediction and entropy of printed English’, C.E. Shannon, Bell Syst. Tech. J. 30, pp.50–64 (1950), but, inexplicably, the great man made numerical errors in it.]

2.10 Solutions Solution to exercise 2.2 (p.24). No, they are not independent. If they were then all the conditional distributions P (y | x) would be identical functions of y, regardless of x (cf. figure 2.3). Solution to exercise 2.4 (p.27).

We define the fraction f B ≡ B/K.

(a) The number of black balls has a binomial distribution. P (nB | fB , N ) =



 N f nB (1 − fB )N −nB . nB B

(2.57)

(b) The mean and variance of this distribution are: E[nB ] = N fB

(2.58)

var[nB ] = N fB (1 − fB ).

(2.59)

These results p in example 1.1 (p.1). The standard deviation p were derived of nB is var[nB ] = N fB (1 − fB ).

When B/K = 1/5 and N = 5, the expectation and variance of n B are 1 and 4/5. The standard deviation is 0.89. When B/K = 1/5 and N = 400, the expectation and variance of n B are 80 and 64. The standard deviation is 8. Solution to exercise 2.5 (p.27).

The numerator of the quantity

z=

(nB − fB N )2 N fB (1 − fB )

can be recognized as (nB − E[nB ])2 ; the denominator is equal to the variance of nB (2.59), which is by definition the expectation of the numerator. So the expectation of z is 1. [A random variable like z, which measures the deviation of data from the expected value, is sometimes called χ 2 (chi-squared).] In the case N = 5 and fB = 1/5, N fB is 1, and var[nB ] is 4/5. The numerator has five possible values, only one of which is smaller than 1: (n B − fB N )2 = 0 has probability P (nB = 1) = 0.4096; so the probability that z < 1 is 0.4096.

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

2.10: Solutions

41

Solution to exercise 2.14 (p.35).

We wish to prove, given the property

f (λx1 + (1 − λ)x2 ) ≤ λf (x1 ) + (1 − λ)f (x2 ), that, if

P

(2.60)

pi = 1 and pi ≥ 0, I X i=1

I X

pi f (xi ) ≥ f

pi xi

i=1

!

.

(2.61)

We proceed by recursion, working from the right-hand side. (This proof does not handle cases where some pi = 0; such details are left to the pedantic reader.) At the first line we use the definition of convexity (2.60) with λ = p1 = p1 ; at the second line, λ = Ip2 . I i=1

f

pi

I X i=1

i=2

pi xi

!

=f

≤ p1 f (x1 ) + ≤ p1 f (x1 ) +

p1 x1 +

I X

pi xi

i=2

"

I X

i=2 " I X i=2

pi pi

#" #"

and so forth.

I X

f

,

pi xi

i=2

p2 PI

i=2 pi

pi

! I X

pi

i=2

f (x2 ) +

!#

PI

pi f Pi=3 I i=2 pi

(2.62) I X i=3

pi xi

,

I X i=3

pi

!#

, 2

Solution to exercise 2.16 (p.36). (a) For the outcomes {2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12}, the probabilities are P = 1 2 3 4 5 6 5 4 3 2 1 , 36 , 36 , 36 , 36 , 36 , 36 , 36 , 36 , 36 , 36 }. { 36 (b) The value of one die has mean 3.5 and variance 35/12. So the sum of one hundred has mean 350 and variance 3500/12 ' 292, and by the central-limit theorem the probability distribution is roughly Gaussian (but confined to the integers), with this mean and variance. (c) In order to obtain a sum that has a uniform distribution we have to start from random variables some of which have a spiky distribution with the probability mass concentrated at the extremes. The unique solution is to have one ordinary die and one with faces 6, 6, 6, 0, 0, 0. (d) Yes, a uniform distribution can be created in several ways, for example by labelling the rth die with the numbers {0, 1, 2, 3, 4, 5} × 6 r . Solution to exercise 2.17 (p.36).

and q = 1 − p gives

a = ln

p q

p 1−p

= ea



p =

p = ea q



(2.63)

(2.64) ea

ea

+1

=

1 . 1 + exp(−a)

(2.65)

The hyperbolic tangent is tanh(a) =

ea − e−a ea + e−a

(2.66)

To think about: does this uniform distribution contradict the central-limit theorem?

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

42

2 — Probability, Entropy, and Inference

so f (a) ≡ =

 1 − e−a + 1 1 + e−a ! 1 + 1 = (tanh(a/2) + 1). 2

1 1 = 1 + exp(−a) 2 ea/2 − e−a/2 ea/2 + e−a/2

1 2



(2.67)

In the case b = log 2 p/q, we can repeat steps (2.63–2.65), replacing e by 2, to obtain 1 . (2.68) p= 1 + 2−b Solution to exercise 2.18 (p.36). P (y | x)P (x) P (y) P (y | x = 1) P (x = 1) = P (y | x = 0) P (x = 0) P (y | x = 1) P (x = 1) = log + log . P (y | x = 0) P (x = 0)

P (x | y) = P (x = 1 | y) P (x = 0 | y) P (x = 1 | y) ⇒ log P (x = 0 | y) ⇒

(2.69) (2.70) (2.71)

Solution to exercise 2.19 (p.36). The conditional independence of d 1 and d2 given x means P (x, d1 , d2 ) = P (x)P (d1 | x)P (d2 | x). (2.72) This gives a separation of the posterior probability ratio into a series of factors, one for each data point, times the prior probability ratio. P (x = 1 | {di }) P (x = 0 | {di })

= =

P ({di } | x = 1) P (x = 1) P ({di } | x = 0) P (x = 0) P (d1 | x = 1) P (d2 | x = 1) P (x = 1) . P (d1 | x = 0) P (d2 | x = 0) P (x = 0)

(2.73) (2.74)

Life in high-dimensional spaces Solution to exercise 2.20 (p.37). N dimensions is in fact

The volume of a hypersphere of radius r in

V (r, N ) =

π N/2 N r , (N/2)!

(2.75)

but you don’t need to know this. For this question all that we need is the r-dependence, V (r, N ) ∝ r N . So the fractional volume in (r − , r) is   N r N − (r − )N = 1 − 1 − . rN r

(2.76)

The fractional volumes in the shells for the required cases are: N

2

10

1000

/r = 0.01 /r = 0.5

0.02 0.75

0.096 0.999

0.99996 1 − 2−1000

Notice that no matter how small  is, for large enough N essentially all the probability mass is in the surface shell of thickness .

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

2.10: Solutions

43

Solution to exercise 2.21 (p.37). f (b) = 5, and f (c) = 10/7.

p a = 0.1, pb = 0.2, pc = 0.7.

E [f (x)] = 0.1 × 10 + 0.2 × 5 + 0.7 × 10/7 = 3.

f (a) = 10, (2.77)

For each x, f (x) = 1/P (x), so E [1/P (x)] = E [f (x)] = 3.

(2.78)

Solution to exercise 2.22 (p.37). For general X, X X E [1/P (x)] = P (x)1/P (x) = 1 = |AX |. x∈AX

(2.79)

x∈AX

Solution to exercise 2.23 (p.37). p a = 0.1, pb = 0.2, pc = 0.7. g(a) = 0, g(b) = 1, and g(c) = 0. E [g(x)] = pb = 0.2. (2.80) Solution to exercise 2.24 (p.37). P (P (x) ∈ [0.15, 0.5]) = pb = 0.2.   P (x) > 0.05 = pa + pc = 0.8. P log 0.2

(2.81) (2.82)

Solution to exercise 2.25 (p.37). This type of question can be approached in two ways: either by differentiating the function to be maximized, finding the maximum, and proving it is a global maximum; this strategy is somewhat risky since it is possible for the maximum of a function to be at the boundary of the space, at a place where the derivative is not zero. Alternatively, a carefully chosen inequality can establish the answer. The second method is much neater. Proof by differentiation (not the recommended method). Since it is slightly easier to differentiate ln 1/p than log 2 1/p, we temporarily define H(X) to be measured using natural logarithms, thus scaling it down by a factor of log 2 e. X 1 H(X) = pi ln (2.83) pi i

∂H(X) ∂pi

= ln

we maximize subject to the constraint a Lagrange multiplier:

P

1 −1 pi

i pi

G(p) ≡ H(X) + λ ∂G(p) ∂pi

= ln

(2.84)

= 1 which can be enforced with X i

pi − 1

1 − 1 + λ. pi

!

(2.85) (2.86)

At a maximum, ln

1 −1+λ = 0 pi 1 = 1 − λ, ⇒ ln pi

(2.87) (2.88)

so all the pi are equal. That this extremum is indeed a maximum is established by finding the curvature: 1 ∂ 2 G(p) = − δij , (2.89) ∂pi ∂pj pi which is negative definite. 2

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

44

2 — Probability, Entropy, and Inference

Proof using Jensen’s inequality (recommended method). the inequality.

First a reminder of

If f is a convex ^ function and x is a random variable then: E [f (x)] ≥ f (E[x]) . If f is strictly convex ^ and E [f (x)] = f (E[x]), then the random variable x is a constant (with probability 1). The secret of a proof using Jensen’s inequality is to choose the right function and the right random variable. We could define f (u) = log

1 = − log u u

(2.90)

P (which is a convex function) and think of H(X) = pi log p1i as the mean of f (u) where u = P (x), but this would not get us there – it would give us an inequality in the wrong direction. If instead we define u = 1/P (x)

(2.91)

H(X) = −E [f (1/P (x))] ≤ −f (E[1/P (x)]) ;

(2.92)

then we find:

now we know from exercise 2.22 (p.37) that E[1/P (x)] = |A X |, so H(X) ≤ −f (|AX |) = log |AX |.

(2.93)

Equality holds only if the random variable u = 1/P (x) is a constant, which means P (x) is a constant for all x. 2 Solution to exercise 2.26 (p.37). DKL (P ||Q) =

X x

P (x) log

P (x) . Q(x)

(2.94)

We prove Gibbs’ inequality using Jensen’s inequality. Let f (u) = log 1/u and . Then u = Q(x) P (x) DKL (P ||Q) = E[f (Q(x)/P (x))] !   X Q(x) 1 ≥ f P (x) = log P = 0, P (x) x Q(x) x

with equality only if u =

Q(x) P (x)

is a constant, that is, if Q(x) = P (x).

(2.95) (2.96) 2

Second solution. In the above proof the expectations were with respect to the probability distribution P (x). A second solution method uses Jensen’s (x) inequality with Q(x) instead. We define f (u) = u log u and let u = PQ(x) . Then   X P (x) P (x) P (x) X Q(x) DKL (P ||Q) = Q(x)f log = (2.97) Q(x) Q(x) Q(x) x x ! X P (x) ≥ f Q(x) = f (1) = 0, (2.98) Q(x) x with equality only if u =

P (x) Q(x)

is a constant, that is, if Q(x) = P (x).

2

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

2.10: Solutions

45

Solution to exercise 2.28 (p.38). H(X) = H2 (f ) + f H2 (g) + (1 − f )H2 (h).

(2.99)

Solution to exercise 2.29 (p.38). The probability that there are x − 1 tails and then one head (so we get the first head on the xth toss) is P (x) = (1 − f )x−1 f.

(2.100)

If the first toss is a tail, the probability distribution for the future looks just like it did before we made the first toss. Thus we have a recursive expression for the entropy: H(X) = H2 (f ) + (1 − f )H(X). (2.101) Rearranging, H(X) = H2 (f )/f. Solution to exercise 2.34 (p.38).

(2.102)

The probability of the number of tails t is

 t 1 1 for t ≥ 0. P (t) = 2 2

(2.103)

The expected number of heads is 1, by definition of the problem. The expected number of tails is ∞  t X 1 1 t E[t] = , (2.104) 2 2 t=0

which may be shown to be 1 in a variety of ways. For example, since the situation after one tail is thrown is equivalent to the opening situation, we can write down the recurrence relation 1 1 E[t] = (1 + E[t]) + 0 ⇒ E[t] = 1. 2 2

ˆ P (f)

0.4 0.3 0.2 0.1 0 0

(2.105)

The probability distribution of the ‘estimator’ fˆ = 1/(1 + t), given that f = 1/2, is plotted in figure 2.12. The probability of fˆ is simply the probability of the corresponding value of t. Solution to exercise 2.35 (p.38). (a) The mean number of rolls from one six to the next six is six (assuming we start counting rolls after the first of the two sixes). The probability that the next six occurs on the rth roll is the probability of not getting a six for r − 1 rolls multiplied by the probability of then getting a six:  r−1 1 5 , for r ∈ {1, 2, 3, . . .}. (2.106) P (r1 = r) = 6 6 This probability distribution of the number of rolls, r, may be called an exponential distribution, since P (r1 = r) = e−αr /Z,

0.5

(2.107)

where α = ln(6/5), and Z is a normalizing constant. (b) The mean number of rolls from the clock until the next six is six. (c) The mean number of rolls, going back in time, until the most recent six is six.

0.2

0.4

0.6

0.8

1

fˆ Figure 2.12. The probability distribution of the estimator fˆ = 1/(1 + t), given that f = 1/2.

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

46

2 — Probability, Entropy, and Inference

(d) The mean number of rolls from the six before the clock struck to the six after the clock struck is the sum of the answers to (b) and (c), less one, that is, eleven. (e) Rather than explaining the difference between (a) and (d), let me give another hint. Imagine that the buses in Poissonville arrive independently at random (a Poisson process), with, on average, one bus every six minutes. Imagine that passengers turn up at bus-stops at a uniform rate, and are scooped up by the bus without delay, so the interval between two buses remains constant. Buses that follow gaps bigger than six minutes become overcrowded. The passengers’ representative complains that two-thirds of all passengers found themselves on overcrowded buses. The bus operator claims, ‘no, no – only one third of our buses are overcrowded’. Can both these claims be true? Solution to exercise 2.38 (p.39). Binomial distribution method. From the solution to exercise 1.2, p B = 3f 2 (1 − f ) + f 3 . Sum rule method. The marginal probabilities of the eight values of r are illustrated by: P (r = 000) = 1/2(1 − f )3 + 1/2f 3 , (2.108) P (r = 001) = 1/2f (1 − f )2 + 1/2f 2 (1 − f ) = 1/2f (1 − f ).

(2.109)

The posterior probabilities are represented by P (s = 1 | r = 000) = and P (s = 1 | r = 001) =

f3 (1 − f )3 + f 3

(1 − f )f 2 = f. f (1 − f )2 + f 2 (1 − f )

0.15 0.1 0.05 0 0

5

10

15

20

Figure 2.13. The probability distribution of the number of rolls r1 from one 6 to the next (falling solid line), P (r1 = r) =

 r−1 1 5 , 6 6

and the probability distribution (dashed line) of the number of rolls from the 6 before 1pm to the next 6, rtot , P (rtot = r) = r

 r−1  2 5 1 . 6 6

The probability P (r1 > 6) is about 1/3; the probability P (rtot > 6) is about 2/3. The mean of r1 is 6, and the mean of rtot is 11.

(2.110)

(2.111)

The probabilities of error in these representative cases are thus P (error | r = 000) =

f3 (1 − f )3 + f 3

(2.112)

and P (error | r = 001) = f.

(2.113)

Notice that while the average probability of error of R 3 is about 3f 2 , the probability (given r) that any particular bit is wrong is either about f 3 or f . The average error probability, using the sum rule, is X P (r)P (error | r) P (error) = r

f3 = 2[1/2(1 − f )3 + 1/2f 3 ] + 6[1/2f (1 − f )]f. (1 − f )3 + f 3

So P (error) = f 3 + 3f 2 (1 − f ). Solution to exercise 2.39 (p.40).

The entropy is 9.7 bits per word.

The first two terms are for the cases r = 000 and 111; the remaining 6 are for the other outcomes, which share the same probability of occurring and identical error probability, f .

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

About Chapter 3 If you are eager to get on to information theory, data compression, and noisy channels, you can skip to Chapter 4. Data compression and data modelling are intimately connected, however, so you’ll probably want to come back to this chapter by the time you get to Chapter 6. Before reading Chapter 3, it might be good to look at the following exercises. . Exercise 3.1.[2, p.59] A die is selected at random from two twenty-faced dice on which the symbols 1–10 are written with nonuniform frequency as follows. Symbol

1

2

3

4

5

6

7

8

9

10

Number of faces of die A Number of faces of die B

6 3

4 3

3 2

2 2

1 2

1 2

1 2

1 2

1 1

0 1

The randomly chosen die is rolled 7 times, with the following outcomes: 5, 3, 9, 3, 8, 4, 7. What is the probability that the die is die A? . Exercise 3.2.[2, p.59] Assume that there is a third twenty-faced die, die C, on which the symbols 1–20 are written once each. As above, one of the three dice is selected at random and rolled 7 times, giving the outcomes: 3, 5, 4, 8, 3, 9, 7. What is the probability that the die is (a) die A, (b) die B, (c) die C? Exercise 3.3.[3, p.48] Inferring a decay constant Unstable particles are emitted from a source and decay at a distance x, a real number that has an exponential probability distribution with characteristic length λ. Decay events can be observed only if they occur in a window extending from x = 1 cm to x = 20 cm. N decays are observed at locations {x1 , . . . , xN }. What is λ?

* ** * *

*

*

* *

x . Exercise 3.4.[3, p.55] Forensic evidence Two people have left traces of their own blood at the scene of a crime. A suspect, Oliver, is tested and found to have type ‘O’ blood. The blood groups of the two traces are found to be of type ‘O’ (a common type in the local population, having frequency 60%) and of type ‘AB’ (a rare type, with frequency 1%). Do these data (type ‘O’ and ‘AB’ blood were found at scene) give evidence in favour of the proposition that Oliver was one of the two people present at the crime?

47

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

3 More about Inference

It is not a controversial statement that Bayes’ theorem provides the correct language for describing the inference of a message communicated over a noisy channel, as we used it in Chapter 1 (p.6). But strangely, when it comes to other inference problems, the use of Bayes’ theorem is not so widespread.

3.1 A first inference problem When I was an undergraduate in Cambridge, I was privileged to receive supervisions from Steve Gull. Sitting at his desk in a dishevelled office in St. John’s College, I asked him how one ought to answer an old Tripos question (exercise 3.3): Unstable particles are emitted from a source and decay at a distance x, a real number that has an exponential probability distribution with characteristic length λ. Decay events can be observed only if they occur in a window extending from x = 1 cm to x = 20 cm. N decays are observed at locations {x 1 , . . . , xN }. What is λ?

* ** * *

*

*

* *

x I had scratched my head over this for some time. My education had provided me with a couple of approaches to solving such inference problems: constructing ‘estimators’ of the unknown parameters; or ‘fitting’ the model to the data, or to a processed version of the data. Since the mean of an unconstrained exponential distribution is λ, it seemed P reasonable to examine the sample mean x ¯ = n xn /N and see if an estimator ˆ could be obtained from it. It was evident that the estimator λ ˆ=x λ ¯ −1 would be appropriate for λ  20 cm, but not for cases where the truncation of the distribution at the right-hand side is significant; with a little ingenuity and the introduction of ad hoc bins, promising estimators for λ  20 cm could be constructed. But there was no obvious estimator that would work under all conditions. Nor could I find a satisfactory approach based on fitting the density P (x | λ) to a histogram derived from the data. I was stuck. What is the general solution to this problem and others like it? Is it always necessary, when confronted by a new inference problem, to grope in the dark for appropriate ‘estimators’ and worry about finding the ‘best’ estimator (whatever that means)? 48

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

3.1: A first inference problem

49

0.25

Figure 3.1. The probability density P (x | λ) as a function of x.

P(x|lambda=2) P(x|lambda=5) P(x|lambda=10)

0.2 0.15 0.1 0.05 0 2

4

6

8

10

12

14

16

18

20

x

0.2

Figure 3.2. The probability density P (x | λ) as a function of λ, for three different values of x. When plotted this way round, the function is known as the likelihood of λ. The marks indicate the three values of λ, λ = 2, 5, 10, that were used in the preceding figure.

P(x=3|lambda) P(x=5|lambda) P(x=12|lambda)

0.15

0.1

0.05

0 1

10

100

λ

Steve wrote down the probability of one data point, given λ:  1 −x/λ e /Z(λ) 1 < x < 20 λ P (x | λ) = 0 otherwise where Z(λ) =

Z

20

dx 1

1 λ

  e−x/λ = e−1/λ − e−20/λ .

(3.1)

(3.2)

This seemed obvious enough. Then he wrote Bayes’ theorem: P (λ | {x1 , . . . , xN }) = ∝

P ({x} | λ)P (λ) P ({x})  P  1 N exp − x /λ P (λ). 1 n (λZ(λ))N

(3.3) (3.4)

Suddenly, the straightforward distribution P ({x 1 , . . . , xN } | λ), defining the probability of the data given the hypothesis λ, was being turned on its head so as to define the probability of a hypothesis given the data. A simple figure showed the probability of a single data point P (x | λ) as a familiar function of x, for different values of λ (figure 3.1). Each curve was an innocent exponential, normalized to have area 1. Plotting the same function as a function of λ for a fixed value of x, something remarkable happens: a peak emerges (figure 3.2). To help understand these two points of view of the one function, figure 3.3 shows a surface plot of P (x | λ) as a function of x and λ. For a dataset consisting of several points, e.g., the six points {x} N n=1 = {1.5, 2, 3, 4, 5, 12}, the likelihood function P ({x} | λ) is the product of the N functions of λ, P (xn | λ) (figure 3.4). 1.4e-06

3

2

1 100 10

1

x

1.5 1

2

λ

2.5

Figure 3.3. The probability density P (x | λ) as a function of x and λ. Figures 3.1 and 3.2 are vertical sections through this surface.

Figure 3.4. The likelihood function in the case of a six-point dataset, P ({x} = {1.5, 2, 3, 4, 5, 12} | λ), as a function of λ.

1.2e-06 1e-06 8e-07 6e-07 4e-07 2e-07 0 1

10

100

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

50

3 — More about Inference Steve summarized Bayes’ theorem as embodying the fact that what you know about λ after the data arrive is what you knew before [P (λ)], and what the data told you [P ({x} | λ)].

Probabilities are used here to quantify degrees of belief. To nip possible confusion in the bud, it must be emphasized that the hypothesis λ that correctly describes the situation is not a stochastic variable, and the fact that the Bayesian uses a probability distribution P does not mean that he thinks of the world as stochastically changing its nature between the states described by the different hypotheses. He uses the notation of probabilities to represent his beliefs about the mutually exclusive micro-hypotheses (here, values of λ), of which only one is actually true. That probabilities can denote degrees of belief, given assumptions, seemed reasonable to me. The posterior probability distribution (3.4) represents the unique and complete solution to the problem. There is no need to invent ‘estimators’; nor do we need to invent criteria for comparing alternative estimators with each other. Whereas orthodox statisticians offer twenty ways of solving a problem, and another twenty different criteria for deciding which of these solutions is the best, Bayesian statistics only offers one answer to a well-posed problem.

Assumptions in inference Our inference is conditional on our assumptions [for example, the prior P (λ)]. Critics view such priors as a difficulty because they are ‘subjective’, but I don’t see how it could be otherwise. How can one perform inference without making assumptions? I believe that it is of great value that Bayesian methods force one to make these tacit assumptions explicit. First, once assumptions are made, the inferences are objective and unique, reproducible with complete agreement by anyone who has the same information and makes the same assumptions. For example, given the assumptions listed above, H, and the data D, everyone will agree about the posterior probability of the decay length λ: P (λ | D, H) =

P (D | λ, H)P (λ | H) . P (D | H)

(3.5)

Second, when the assumptions are explicit, they are easier to criticize, and easier to modify – indeed, we can quantify the sensitivity of our inferences to the details of the assumptions. For example, we can note from the likelihood curves in figure 3.2 that in the case of a single data point at x = 5, the likelihood function is less strongly peaked than in the case x = 3; the details of the prior P (λ) become increasingly important as the sample mean x ¯ gets closer to the middle of the window, 10.5. In the case x = 12, the likelihood function doesn’t have a peak at all – such data merely rule out small values of λ, and don’t give any information about the relative probabilities of large values of λ. So in this case, the details of the prior at the small–λ end of things are not important, but at the large–λ end, the prior is important. Third, when we are not sure which of various alternative assumptions is the most appropriate for a problem, we can treat this question as another inference task. Thus, given data D, we can compare alternative assumptions H using Bayes’ theorem: P (H | D, I) =

P (D | H, I)P (H | I) , P (D | I)

(3.6)

If you have any difficulty understanding this chapter I recommend ensuring you are happy with exercises 3.1 and 3.2 (p.47) then noting their similarity to exercise 3.3.

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

3.2: The bent coin

51

where I denotes the highest assumptions, which we are not questioning. Fourth, we can take into account our uncertainty regarding such assumptions when we make subsequent predictions. Rather than choosing one particular assumption H∗ , and working out our predictions about some quantity t, P (t | D, H∗ , I), we obtain predictions that take into account our uncertainty about H by using the sum rule: X P (t | D, I) = P (t | D, H, I)P (H | D, I). (3.7) H

This is another contrast with orthodox statistics, in which it is conventional to ‘test’ a default model, and then, if the test ‘accepts the model’ at some ‘significance level’, to use exclusively that model to make predictions. Steve thus persuaded me that probability theory reaches parts that ad hoc methods cannot reach. Let’s look at a few more examples of simple inference problems.

3.2 The bent coin A bent coin is tossed F times; we observe a sequence s of heads and tails (which we’ll denote by the symbols a and b). We wish to know the bias of the coin, and predict the probability that the next toss will result in a head. We first encountered this task in example 2.7 (p.30), and we will encounter it again in Chapter 6, when we discuss adaptive data compression. It is also the original inference problem studied by Thomas Bayes in his essay published in 1763. As in exercise 2.8 (p.30), we will assume a uniform prior distribution and obtain a posterior distribution by multiplying by the likelihood. A critic might object, ‘where did this prior come from?’ I will not claim that the uniform prior is in any way fundamental; indeed we’ll give examples of nonuniform priors later. The prior is a subjective assumption. One of the themes of this book is: you can’t do inference – or data compression – without making assumptions. We give the name H1 to our assumptions. [We’ll be introducing an alternative set of assumptions in a moment.] The probability, given p a , that F tosses result in a sequence s that contains {F a , Fb } counts of the two outcomes is (3.8) P (s | pa , F, H1 ) = pFa a (1 − pa )Fb . [For example, P (s = aaba | pa , F = 4, H1 ) = pa pa (1 − pa )pa .] Our first model assumes a uniform prior distribution for p a , P (pa | H1 ) = 1,

pa ∈ [0, 1]

(3.9)

and pb ≡ 1 − pa . Inferring unknown parameters Given a string of length F of which Fa are as and Fb are bs, we are interested in (a) inferring what pa might be; (b) predicting whether the next character is

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

52

3 — More about Inference

an a or a b. [Predictions are always expressed as probabilities. So ‘predicting whether the next character is an a’ is the same as computing the probability that the next character is an a.] Assuming H1 to be true, the posterior probability of p a , given a string s of length F that has counts {Fa , Fb }, is, by Bayes’ theorem, P (pa | s, F, H1 ) =

P (s | pa , F, H1 )P (pa | H1 ) . P (s | F, H1 )

(3.10)

pFa a (1 − pa )Fb . P (s | F, H1 )

(3.11)

The factor P (s | pa , F, H1 ), which, as a function of pa , is known as the likelihood function, was given in equation (3.8); the prior P (p a | H1 ) was given in equation (3.9). Our inference of pa is thus: P (pa | s, F, H1 ) =

The normalizing constant is given by the beta integral Z 1 Γ(Fa + 1)Γ(Fb + 1) Fa !Fb ! dpa pFa a (1 − pa )Fb = = . P (s | F, H1 ) = Γ(F + F + 2) (F + Fb + 1)! a b a 0 (3.12) Exercise 3.5.[2, p.59] Sketch the posterior probability P (p a | s = aba, F = 3). What is the most probable value of pa (i.e., the value that maximizes the posterior probability density)? What is the mean value of p a under this distribution? Answer the same P (pa | s = bbb, F = 3).

questions

for

the

posterior

probability

From inferences to predictions Our prediction about the next toss, the probability that the next toss is an a, is obtained by integrating over pa . This has the effect of taking into account our uncertainty about pa when making predictions. By the sum rule, Z P (a | s, F ) = dpa P (a | pa )P (pa | s, F ). (3.13) The probability of an a given pa is simply pa , so Z pFa (1 − pa )Fb P (a | s, F ) = dpa pa a P (s | F ) Z pFa a +1 (1 − pa )Fb = dpa P (s | F )     (Fa + 1)! Fb ! Fa ! F b ! Fa + 1 = = , (Fa + Fb + 2)! (Fa + Fb + 1)! Fa + F b + 2

(3.14) (3.15) (3.16)

which is known as Laplace’s rule.

3.3 The bent coin and model comparison Imagine that a scientist introduces another theory for our data. He asserts that the source is not really a bent coin but is really a perfectly formed die with one face painted heads (‘a’) and the other five painted tails (‘b’). Thus the parameter pa , which in the original model, H1 , could take any value between 0 and 1, is according to the new hypothesis, H 0 , not a free parameter at all; rather, it is equal to 1/6. [This hypothesis is termed H 0 so that the suffix of each model indicates its number of free parameters.] How can we compare these two models in the light of data? We wish to infer how probable H1 is relative to H0 .

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

3.3: The bent coin and model comparison

53

Model comparison as inference In order to perform model comparison, we write down Bayes’ theorem again, but this time with a different argument on the left-hand side. We wish to know how probable H1 is given the data. By Bayes’ theorem, P (H1 | s, F ) =

P (s | F, H1 )P (H1 ) . P (s | F )

(3.17)

Similarly, the posterior probability of H 0 is P (H0 | s, F ) =

P (s | F, H0 )P (H0 ) . P (s | F )

(3.18)

The normalizing constant in both cases is P (s | F ), which is the total probability of getting the observed data. If H 1 and H0 are the only models under consideration, this probability is given by the sum rule: P (s | F ) = P (s | F, H1 )P (H1 ) + P (s | F, H0 )P (H0 ).

(3.19)

To evaluate the posterior probabilities of the hypotheses we need to assign values to the prior probabilities P (H 1 ) and P (H0 ); in this case, we might set these to 1/2 each. And we need to evaluate the data-dependent terms P (s | F, H1 ) and P (s | F, H0 ). We can give names to these quantities. The quantity P (s | F, H1 ) is a measure of how much the data favour H 1 , and we call it the evidence for model H1 . We already encountered this quantity in equation (3.10) where it appeared as the normalizing constant of the first inference we made – the inference of p a given the data. How model comparison works: The evidence for a model is usually the normalizing constant of an earlier Bayesian inference. We evaluated the normalizing constant for model H 1 in (3.12). The evidence for model H0 is very simple because this model has no parameters to infer. Defining p0 to be 1/6, we have P (s | F, H0 ) = pF0 a (1 − p0 )Fb .

(3.20)

Thus the posterior probability ratio of model H 1 to model H0 is P (H1 | s, F ) P (H0 | s, F )

= =

P (s | F, H1 )P (H1 ) P (s | F, H0 )P (H0 )  Fa !Fb ! pF0 a (1 − p0 )Fb . (Fa + Fb + 1)!

(3.21) (3.22)

Some values of this posterior probability ratio are illustrated in table 3.5. The first five lines illustrate that some outcomes favour one model, and some favour the other. No outcome is completely incompatible with either model. With small amounts of data (six tosses, say) it is typically not the case that one of the two models is overwhelmingly more probable than the other. But with more data, the evidence against H0 given by any data set with the ratio F a : Fb differing from 1: 5 mounts up. You can’t predict in advance how much data are needed to be pretty sure which theory is true. It depends what p a is. The simpler model, H0 , since it has no adjustable parameters, is able to lose out by the biggest margin. The odds may be hundreds to one against it. The more complex model can never lose out by a large margin; there’s no data set that is actually unlikely given model H 1 .

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

54

3 — More about Inference

F

Data (Fa , Fb )

6 6 6 6 6

(5, 1) (3, 3) (2, 4) (1, 5) (0, 6)

20 20 20

(10, 10) (3, 17) (0, 20)

222.2 2.67 0.71 0.356 0.427

pa = 1/6

0

50

1000/1 100/1 10/1 1/1 1/10 1/100 100 150 200

50

1000/1 100/1 10/1 1/1 1/10 1/100 100 150 200

50

1000/1 100/1 10/1 1/1 1/10 1/100 100 150 200

8 6 4 2 0 -2 -4 0 8 6 4 2 0 -2 -4 0

= 1/1.4 = 1/2.8 = 1/2.3

96.5 0.2 1.83

H0 is true 8 6 4 2 0 -2 -4

Table 3.5. Outcome of model comparison between models H1 and H0 for the ‘bent coin’. Model H0 states that pa = 1/6, pb = 5/6.

P (H1 | s, F ) P (H0 | s, F )

= 1/5

H1 is true pa = 0.25

8 6 4 2 0 -2 -4 0

50

1000/1 100/1 10/1 1/1 1/10 1/100 100 150 200

50

1000/1 100/1 10/1 1/1 1/10 1/100 100 150 200

50

1000/1 100/1 10/1 1/1 1/10 1/100 100 150 200

8 6 4 2 0 -2 -4 0 8 6 4 2 0 -2 -4 0

pa = 0.5

8 6 4 2 0 -2 -4 0

50

1000/1 100/1 10/1 1/1 1/10 1/100 100 150 200

50

1000/1 100/1 10/1 1/1 1/10 1/100 100 150 200

50

1000/1 100/1 10/1 1/1 1/10 1/100 100 150 200

8 6 4 2 0 -2 -4 0 8 6 4 2 0 -2 -4 0

. Exercise 3.6.[2 ] Show that after F tosses have taken place, the biggest value that the log evidence ratio log

P (s | F, H1 ) P (s | F, H0 )

(3.23)

can have scales linearly with F if H1 is more probable, but the log evidence in favour of H0 can grow at most as log F . . Exercise 3.7.[3, p.60] Putting your sampling theory hat on, assuming F a has not yet been measured, compute a plausible range that the log evidence ratio might lie in, as a function of F and the true value of p a , and sketch it as a function of F for pa = p0 = 1/6, pa = 0.25, and pa = 1/2. [Hint: sketch the log evidence as a function of the random variable F a and work out the mean and standard deviation of F a .]

Typical behaviour of the evidence Figure 3.6 shows the log evidence ratio as a function of the number of tosses, F , in a number of simulated experiments. In the left-hand experiments, H 0 was true. In the right-hand ones, H1 was true, and the value of pa was either 0.25 or 0.5. We will discuss model comparison more in a later chapter.

Figure 3.6. Typical behaviour of the evidence in favour of H1 as bent coin tosses accumulate under three different conditions (columns 1, 2, 3). Horizontal axis is the number of tosses, F . The vertical axis on the left is (s | F, H1 ) ln P ; the right-hand P (s | F, H0 ) vertical axis shows the values of P (s | F, H1 ) . P (s | F, H0 ) The three rows show independent simulated experiments. (See also figure 3.8, p.60.)

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

3.4: An example of legal evidence

55

3.4 An example of legal evidence The following example illustrates that there is more to Bayesian inference than the priors. Two people have left traces of their own blood at the scene of a crime. A suspect, Oliver, is tested and found to have type ‘O’ blood. The blood groups of the two traces are found to be of type ‘O’ (a common type in the local population, having frequency 60%) and of type ‘AB’ (a rare type, with frequency 1%). Do these data (type ‘O’ and ‘AB’ blood were found at scene) give evidence in favour of the proposition that Oliver was one of the two people present at the crime? A careless lawyer might claim that the fact that the suspect’s blood type was found at the scene is positive evidence for the theory that he was present. But this is not so. Denote the proposition ‘the suspect and one unknown person were present’ ¯ states ‘two unknown people from the population were by S. The alternative, S, present’. The prior in this problem is the prior probability ratio between the ¯ This quantity is important to the final verdict and propositions S and S. would be based on all other available information in the case. Our task here is just to evaluate the contribution made by the data D, that is, the likelihood ¯ H). In my view, a jury’s task should generally be to ratio, P (D | S, H)/P (D | S, multiply together carefully evaluated likelihood ratios from each independent piece of admissible evidence with an equally carefully reasoned prior probability. [This view is shared by many statisticians but learned British appeal judges recently disagreed and actually overturned the verdict of a trial because the jurors had been taught to use Bayes’ theorem to handle complicated DNA evidence.] The probability of the data given S is the probability that one unknown person drawn from the population has blood type AB: P (D | S, H) = pAB

(3.24)

(since given S, we already know that one trace will be of type O). The probability of the data given S¯ is the probability that two unknown people drawn from the population have types O and AB: ¯ H) = 2 pO pAB . P (D | S,

(3.25)

In these equations H denotes the assumptions that two people were present and left blood there, and that the probability distribution of the blood groups of unknown people in an explanation is the same as the population frequencies. Dividing, we obtain the likelihood ratio: 1 P (D | S, H) 1 = = 0.83. = ¯ 2pO 2 × 0.6 P (D | S, H)

(3.26)

Thus the data in fact provide weak evidence against the supposition that Oliver was present. This result may be found surprising, so let us examine it from various points of view. First consider the case of another suspect, Alberto, who has type AB. Intuitively, the data do provide evidence in favour of the theory S 0

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

56

3 — More about Inference

¯ And indeed that this suspect was present, relative to the null hypothesis S. the likelihood ratio in this case is: 1 P (D | S 0 , H) ¯ H) = 2 pAB = 50. P (D | S,

(3.27)

Now let us change the situation slightly; imagine that 99% of people are of blood type O, and the rest are of type AB. Only these two blood types exist in the population. The data at the scene are the same as before. Consider again how these data influence our beliefs about Oliver, a suspect of type O, and Alberto, a suspect of type AB. Intuitively, we still believe that the presence of the rare AB blood provides positive evidence that Alberto was there. But does the fact that type O blood was detected at the scene favour the hypothesis that Oliver was present? If this were the case, that would mean that regardless of who the suspect is, the data make it more probable they were present; everyone in the population would be under greater suspicion, which would be absurd. The data may be compatible with any suspect of either blood type being present, but if they provide evidence for some theories, they must also provide evidence against other theories. Here is another way of thinking about this: imagine that instead of two people’s blood stains there are ten, and that in the entire local population of one hundred, there are ninety type O suspects and ten type AB suspects. Consider a particular type O suspect, Oliver: without any other information, and before the blood test results come in, there is a one in 10 chance that he was at the scene, since we know that 10 out of the 100 suspects were present. We now get the results of blood tests, and find that nine of the ten stains are of type AB, and one of the stains is of type O. Does this make it more likely that Oliver was there? No, there is now only a one in ninety chance that he was there, since we know that only one person present was of type O. Maybe the intuition is aided finally by writing down the formulae for the general case where nO blood stains of individuals of type O are found, and nAB of type AB, a total of N individuals in all, and unknown people come from a large population with fractions p O , pAB . (There may be other blood types too.) The task is to evaluate the likelihood ratio for the two hypotheses: S, ‘the type O suspect (Oliver) and N −1 unknown others left N stains’; and ¯ ‘N unknowns left N stains’. The probability of the data under hypothesis S, ¯ S is just the probability of getting n O , nAB individuals of the two types when N individuals are drawn at random from the population: ¯ = P (nO , nAB | S)

N! pnO pnAB . nO ! nAB ! O AB

(3.28)

In the case of hypothesis S, we need the distribution of the N −1 other individuals: (N − 1)! AB P (nO , nAB | S) = . (3.29) pnO −1 pnAB (nO − 1)! nAB ! O The likelihood ratio is:

P (nO , nAB | S) nO /N ¯ = pO . P (nO , nAB | S)

(3.30)

This is an instructive result. The likelihood ratio, i.e. the contribution of these data to the question of whether Oliver was present, depends simply on a comparison of the frequency of his blood type in the observed data with the background frequency in the population. There is no dependence on the counts of the other types found at the scene, or their frequencies in the population.

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

3.5: Exercises If there are more type O stains than the average number expected under ¯ then the data give evidence in favour of the presence of Oliver. hypothesis S, Conversely, if there are fewer type O stains than the expected number under ¯ then the data reduce the probability of the hypothesis that he was there. S, In the special case nO /N = pO , the data contribute no evidence either way, regardless of the fact that the data are compatible with the hypothesis S.

3.5 Exercises Exercise 3.8.[2, p.60] The three doors, normal rules. On a game show, a contestant is told the rules as follows: There are three doors, labelled 1, 2, 3. A single prize has been hidden behind one of them. You get to select one door. Initially your chosen door will not be opened. Instead, the gameshow host will open one of the other two doors, and he will do so in such a way as not to reveal the prize. For example, if you first choose door 1, he will then open one of doors 2 and 3, and it is guaranteed that he will choose which one to open so that the prize will not be revealed. At this point, you will be given a fresh choice of door: you can either stick with your first choice, or you can switch to the other closed door. All the doors will then be opened and you will receive whatever is behind your final choice of door. Imagine that the contestant chooses door 1 first; then the gameshow host opens door 3, revealing nothing behind the door, as promised. Should the contestant (a) stick with door 1, or (b) switch to door 2, or (c) does it make no difference? Exercise 3.9.[2, p.61] The three doors, earthquake scenario. Imagine that the game happens again and just as the gameshow host is about to open one of the doors a violent earthquake rattles the building and one of the three doors flies open. It happens to be door 3, and it happens not to have the prize behind it. The contestant had initially chosen door 1. Repositioning his toup´ee, the host suggests, ‘OK, since you chose door 1 initially, door 3 is a valid door for me to open, according to the rules of the game; I’ll let door 3 stay open. Let’s carry on as if nothing happened.’ Should the contestant stick with door 1, or switch to door 2, or does it make no difference? Assume that the prize was placed randomly, that the gameshow host does not know where it is, and that the door flew open because its latch was broken by the earthquake. [A similar alternative scenario is a gameshow whose confused host forgets the rules, and where the prize is, and opens one of the unchosen doors at random. He opens door 3, and the prize is not revealed. Should the contestant choose what’s behind door 1 or door 2? Does the optimal decision for the contestant depend on the contestant’s beliefs about whether the gameshow host is confused or not?] . Exercise 3.10.[2 ] Another example in which the emphasis is not on priors. You visit a family whose three children are all at the local school. You don’t

57

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

58

3 — More about Inference know anything about the sexes of the children. While walking clumsily round the home, you stumble through one of the three unlabelled bedroom doors that you know belong, one each, to the three children, and find that the bedroom contains girlie stuff in sufficient quantities to convince you that the child who lives in that bedroom is a girl. Later, you sneak a look at a letter addressed to the parents, which reads ‘From the Headmaster: we are sending this letter to all parents who have male children at the school to inform them about the following boyish matters. . . ’. These two sources of evidence establish that at least one of the three children is a girl, and that at least one of the children is a boy. What are the probabilities that there are (a) two girls and one boy; (b) two boys and one girl?

. Exercise 3.11.[2, p.61] Mrs S is found stabbed in her family garden. Mr S behaves strangely after her death and is considered as a suspect. On investigation of police and social records it is found that Mr S had beaten up his wife on at least nine previous occasions. The prosecution advances this data as evidence in favour of the hypothesis that Mr S is guilty of the murder. ‘Ah no,’ says Mr S’s highly paid lawyer, ‘statistically, only one in a thousand wife-beaters actually goes on to murder his wife. 1 So the wife-beating is not strong evidence at all. In fact, given the wife-beating evidence alone, it’s extremely unlikely that he would be the murderer of his wife – only a 1/1000 chance. You should therefore find him innocent.’ Is the lawyer right to imply that the history of wife-beating does not point to Mr S’s being the murderer? Or is the lawyer a slimy trickster? If the latter, what is wrong with his argument? [Having received an indignant letter from a lawyer about the preceding paragraph, I’d like to add an extra inference exercise at this point: Does my suggestion that Mr. S.’s lawyer may have been a slimy trickster imply that I believe all lawyers are slimy tricksters? (Answer: No.)] . Exercise 3.12.[2 ] A bag contains one counter, known to be either white or black. A white counter is put in, the bag is shaken, and a counter is drawn out, which proves to be white. What is now the chance of drawing a white counter? [Notice that the state of the bag, after the operations, is exactly identical to its state before.] . Exercise 3.13.[2, p.62] You move into a new house; the phone is connected, and you’re pretty sure that the phone number is 740511, but not as sure as you would like to be. As an experiment, you pick up the phone and dial 740511; you obtain a ‘busy’ signal. Are you now more sure of your phone number? If so, how much? . Exercise 3.14.[1 ] In a game, two coins are tossed. If either of the coins comes up heads, you have won a prize. To claim the prize, you must point to one of your coins that is a head and say ‘look, that coin’s a head, I’ve won’. You watch Fred play the game. He tosses the two coins, and he 1 In the U.S.A., it is estimated that 2 million women are abused each year by their partners. In 1994, 4739 women were victims of homicide; of those, 1326 women (28%) were slain by husbands and boyfriends. (Sources: http://www.umn.edu/mincava/papers/factoid.htm, http://www.gunfree.inter.net/vpc/womenfs.htm)

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

3.6: Solutions

59

points to a coin and says ‘look, that coin’s a head, I’ve won’. What is the probability that the other coin is a head? . Exercise 3.15.[2, p.63] A statistical statement appeared in The Guardian on Friday January 4, 2002: When spun on edge 250 times, a Belgian one-euro coin came up heads 140 times and tails 110. ‘It looks very suspicious to me’, said Barry Blight, a statistics lecturer at the London School of Economics. ‘If the coin were unbiased the chance of getting a result as extreme as that would be less than 7%’. But do these data give evidence that the coin is biased rather than fair? [Hint: see equation (3.22).]

3.6 Solutions Solution to exercise 3.1 (p.47). probabilities,

Let the data be D. Assuming equal prior

1313121 9 P (A | D) = = P (B | D) 2212222 32

(3.31)

and P (A | D) = 9/41. Solution to exercise 3.2 (p.47). pothesis is:

The probability of the data given each hy-

P (D | A) =

(3.32)

3 1 2 1 3 1 1 18 = 7; 20 20 20 20 20 20 20 20 64 2 2 2 2 2 1 2 = 7; P (D | B) = 20 20 20 20 20 20 20 20 1 1 1 1 1 1 1 1 P (D | C) = = 7. 20 20 20 20 20 20 20 20

(3.33) (3.34)

So P (A | D) =

18 18 = ; 18 + 64 + 1 83

P (B | D) =

64 ; 83

P (C | D) =

1 . 83 (3.35) Figure 3.7. Posterior probability for the bias pa of a bent coin given two different data sets.

0 0.2 0.4 0.6 0.8 1 (a) P (pa | s = aba, F = 3) ∝ p2a (1 − pa )

(b)

0

0.2

0.4

0.6

0.8

1

P (pa | s = bbb, F = 3) ∝ (1 − pa )3

Solution to exercise 3.5 (p.52). (a) P (pa | s = aba, F = 3) ∝ p2a (1 − pa ). The most probable value of pa (i.e., the value that maximizes the posterior probability density) is 2/3. The mean value of pa is 3/5. See figure 3.7a.

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

60

3 — More about Inference

(b) P (pa | s = bbb, F = 3) ∝ (1 − pa )3 . The most probable value of pa (i.e., the value that maximizes the posterior probability density) is 0. The mean value of pa is 1/5. See figure 3.7b. H0 is true

H1 is true

pa = 1/6

8 6 4 2 0 -2 -4 0

50

1000/1 100/1 10/1 1/1 1/10 1/100 100 150 200

pa = 0.25

8 6 4 2 0 -2 -4 0

50

1000/1 100/1 10/1 1/1 1/10 1/100 100 150 200

pa = 0.5

8 6 4 2 0 -2 -4 0

50

1000/1 100/1 10/1 1/1 1/10 1/100 100 150 200

Solution to exercise 3.7 (p.54). The curves in figure 3.8 were found by finding the mean and standard deviation of F a , then setting Fa to the mean ± two standard deviations to get a 95% plausible range for F a , and computing the three corresponding values of the log evidence ratio. Solution to exercise 3.8 (p.57). Let H i denote the hypothesis that the prize is behind door i. We make the following assumptions: the three hypotheses H 1 , H2 and H3 are equiprobable a priori, i.e., 1 P (H1 ) = P (H2 ) = P (H3 ) = . 3

(3.36)

The datum we receive, after choosing door 1, is one of D = 3 and D = 2 (meaning door 3 or 2 is opened, respectively). We assume that these two possible outcomes have the following probabilities. If the prize is behind door 1 then the host has a free choice; in this case we assume that the host selects at random between D = 2 and D = 3. Otherwise the choice of the host is forced and the probabilities are 0 and 1. P (D = 2 | H1 ) = 1/2 P (D = 2 | H2 ) = 0 P (D = 2 | H3 ) = 1 P (D = 3 | H1 ) = 1/2 P (D = 3 | H2 ) = 1 P (D = 3 | H3 ) = 0

(3.37)

Now, using Bayes’ theorem, we evaluate the posterior probabilities of the hypotheses: P (D = 3 | Hi )P (Hi ) P (Hi | D = 3) = (3.38) P (D = 3) P (H1 | D = 3) = (1/2)(1/3) P (D=3)

P (H2 | D = 3) = P(1)(1/3) (D=3)

P (H3 | D = 3) = P(0)(1/3) (D=3) (3.39) The denominator P (D = 3) is (1/2) because it is the normalizing constant for this posterior distribution. So P (H1 | D = 3) = 1/3 P (H2 | D = 3) = 2/3 P (H3 | D = 3) = 0. (3.40) So the contestant should switch to door 2 in order to have the biggest chance of getting the prize. Many people find this outcome surprising. There are two ways to make it more intuitive. One is to play the game thirty times with a friend and keep track of the frequency with which switching gets the prize. Alternatively, you can perform a thought experiment in which the game is played with a million doors. The rules are now that the contestant chooses one door, then the game

Figure 3.8. Range of plausible values of the log evidence in favour of H1 as a function of F . The vertical axis on the left is (s | F,H1 ) log P ; the right-hand P (s | F,H0 ) vertical axis shows the values of P (s | F,H1 ) . P (s | F,H0 ) The solid line shows the log evidence if the random variable Fa takes on its mean value, Fa = pa F . The dotted lines show (approximately) the log evidence if Fa is at its 2.5th or 97.5th percentile. (See also figure 3.6, p.54.)

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

3.6: Solutions

61

show host opens 999,998 doors in such a way as not to reveal the prize, leaving the contestant’s selected door and one other door closed. The contestant may now stick or switch. Imagine the contestant confronted by a million doors, of which doors 1 and 234,598 have not been opened, door 1 having been the contestant’s initial guess. Where do you think the prize is?

P (D|H1 )(1/3) P (D)

P (H1 |D) = = 1/2

P (D|H2 )(1/3) P (D)

P (H2 |D) = = 1/2

P (D|H3 )(1/3) P (D)

P (H3 |D) = = 0.

(3.41) The two possible hypotheses are now equally likely. If we assume that the host knows where the prize is and might be acting deceptively, then the answer might be further modified, because we have to view the host’s words as part of the data. Confused? It’s well worth making sure you understand these two gameshow problems. Don’t worry, I slipped up on the second problem, the first time I met it. There is a general rule which helps immensely when you have a confusing probability problem: Always write down the probability of everything. (Steve Gull) From this joint probability, any desired inference can be mechanically obtained (figure 3.9). Solution to exercise 3.11 (p.58). The statistic quoted by the lawyer indicates the probability that a randomly selected wife-beater will also murder his wife. The probability that the husband was the murderer, given that the wife has been murdered, is a completely different quantity.

Where the prize is door door door 1 2 3 none

Which doors opened by earthquake

Solution to exercise 3.9 (p.57). If door 3 is opened by an earthquake, the inference comes out differently – even though visually the scene looks the same. The nature of the data, and the probability of the data, are both now different. The possible data outcomes are, firstly, that any number of the doors might have opened. We could label the eight possible outcomes d = (0, 0, 0), (0, 0, 1), (0, 1, 0), (1, 0, 0), (0, 1, 1), . . . , (1, 1, 1). Secondly, it might be that the prize is visible after the earthquake has opened one or more doors. So the data D consists of the value of d, and a statement of whether the prize was revealed. It is hard to say what the probabilities of these outcomes are, since they depend on our beliefs about the reliability of the door latches and the properties of earthquakes, but it is possible to extract the desired posterior probability without naming the values of P (d | H i ) for each d. All that matters are the relative values of the quantities P (D | H 1 ), P (D | H2 ), P (D | H3 ), for the value of D that actually occurred. [This is the likelihood principle, which we met in section 2.3.] The value of D that actually occurred is ‘d = (0, 0, 1), and no prize visible’. First, it is clear that P (D | H 3 ) = 0, since the datum that no prize is visible is incompatible with H 3 . Now, assuming that the contestant selected door 1, how does the probability P (D | H 1 ) compare with P (D | H2 )? Assuming that earthquakes are not sensitive to decisions of game show contestants, these two quantities have to be equal, by symmetry. We don’t know how likely it is that door 3 falls off its hinges, but however likely it is, it’s just as likely to do so whether the prize is behind door 1 or door 2. So, if P (D | H1 ) and P (D | H2 ) are equal, we obtain:

pnone pnone pnone 3 3 3

1 2 3

p3 3

p3 3

p3 3

1,2 1,3 2,3 1,2,3

p1,2,3 p1,2,3 p1,2,3 3 3 3

Figure 3.9. The probability of everything, for the second three-door problem, assuming an earthquake has just occurred. Here, p3 is the probability that door 3 alone is opened by an earthquake.

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

62

3 — More about Inference

To deduce the latter, we need to make further assumptions about the probability that the wife is murdered by someone else. If she lives in a neighbourhood with frequent random murders, then this probability is large and the posterior probability that the husband did it (in the absence of other evidence) may not be very large. But in more peaceful regions, it may well be that the most likely person to have murdered you, if you are found murdered, is one of your closest relatives. Let’s work out some illustrative numbers with the help of the statistics on page 58. Let m = 1 denote the proposition that a woman has been murdered; h = 1, the proposition that the husband did it; and b = 1, the proposition that he beat her in the year preceding the murder. The statement ‘someone else did it’ is denoted by h = 0. We need to define P (h | m = 1), P (b | h = 1, m = 1), and P (b = 1 | h = 0, m = 1) in order to compute the posterior probability P (h = 1 | b = 1, m = 1). From the statistics, we can read out P (h = 1 | m = 1) = 0.28. And if two million women out of 100 million are beaten, then P (b = 1 | h = 0, m = 1) = 0.02. Finally, we need a value for P (b | h = 1, m = 1): if a man murders his wife, how likely is it that this is the first time he laid a finger on her? I expect it’s pretty unlikely; so maybe P (b = 1 | h = 1, m = 1) is 0.9 or larger. By Bayes’ theorem, then, P (h = 1 | b = 1, m = 1) =

.9 × .28 ' 95%. .9 × .28 + .02 × .72

(3.42)

One way to make obvious the sliminess of the lawyer on p.58 is to construct arguments, with the same logical structure as his, that are clearly wrong. For example, the lawyer could say ‘Not only was Mrs. S murdered, she was murdered between 4.02pm and 4.03pm. Statistically, only one in a million wife-beaters actually goes on to murder his wife between 4.02pm and 4.03pm. So the wife-beating is not strong evidence at all. In fact, given the wife-beating evidence alone, it’s extremely unlikely that he would murder his wife in this way – only a 1/1,000,000 chance.’ Solution to exercise 3.13 (p.58). There are two hypotheses. H 0 : your number is 740511; H1 : it is another number. The data, D, are ‘when I dialed 740511, I got a busy signal’. What is the probability of D, given each hypothesis? If your number is 740511, then we expect a busy signal with certainty: P (D | H0 ) = 1. On the other hand, if H1 is true, then the probability that the number dialled returns a busy signal is smaller than 1, since various other outcomes were also possible (a ringing tone, or a number-unobtainable signal, for example). The value of this probability P (D | H1 ) will depend on the probability α that a random phone number similar to your own phone number would be a valid phone number, and on the probability β that you get a busy signal when you dial a valid phone number. I estimate from the size of my phone book that Cambridge has about 75 000 valid phone numbers, all of length six digits. The probability that a random six-digit number is valid is therefore about 75 000/10 6 = 0.075. If we exclude numbers beginning with 0, 1, and 9 from the random choice, the probability α is about 75 000/700 000 ' 0.1. If we assume that telephone numbers are clustered then a misremembered number might be more likely to be valid than a randomly chosen number; so the probability, α, that our guessed number would be valid, assuming H 1 is true, might be bigger than

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

3.6: Solutions

63

0.1. Anyway, α must be somewhere between 0.1 and 1. We can carry forward this uncertainty in the probability and see how much it matters at the end. The probability β that you get a busy signal when you dial a valid phone number is equal to the fraction of phones you think are in use or off-the-hook when you make your tentative call. This fraction varies from town to town and with the time of day. In Cambridge, during the day, I would guess that about 1% of phones are in use. At 4am, maybe 0.1%, or fewer. The probability P (D | H1 ) is the product of α and β, that is, about 0.1 × 0.01 = 10−3 . According to our estimates, there’s about a one-in-a-thousand chance of getting a busy signal when you dial a random number; or one-in-ahundred, if valid numbers are strongly clustered; or one-in-10 4 , if you dial in the wee hours. How do the data affect your beliefs about your phone number? The posterior probability ratio is the likelihood ratio times the prior probability ratio: P (D | H0 ) P (H0 ) P (H0 | D) = . P (H1 | D) P (D | H1 ) P (H1 )

(3.43)

The likelihood ratio is about 100-to-1 or 1000-to-1, so the posterior probability ratio is swung by a factor of 100 or 1000 in favour of H 0 . If the prior probability of H0 was 0.5 then the posterior probability is P (H0 | D) =

1 1+

P (H1 | D) P (H0 | D)

' 0.99 or 0.999.

(3.44)

Solution to exercise 3.15 (p.59). We compare the models H 0 – the coin is fair – and H1 – the coin is biased, with the prior on its bias set to the uniform distribution P (p|H1 ) = 1. [The use of a uniform prior seems reasonable to me, since I know that some coins, such as American pennies, have severe biases when spun on edge; so the situations p = 0.01 or p = 0.1 or p = 0.95 would not surprise me.] When I mention H0 – the coin is fair – a pedant would say, ‘how absurd to even consider that the coin is fair – any coin is surely biased to some extent’. And of course I would agree. So will pedants kindly understand H0 as meaning ‘the coin is fair to within one part in a thousand, i.e., p ∈ 0.5 ± 0.001’.

The likelihood ratio is: 140!110!

P (D|H1 ) = 251! = 0.48. P (D|H0 ) 1/2250

(3.45)

Thus the data give scarcely any evidence either way; in fact they give weak evidence (two to one) in favour of H0 ! ‘No, no’, objects the believer in bias, ‘your silly uniform prior doesn’t represent my prior beliefs about the bias of biased coins – I was expecting only a small bias’. To be as generous as possible to the H 1 , let’s see how well it could fare if the prior were presciently set. Let us allow a prior of the form P (p|H1 , α) =

1 α−1 p (1 − p)α−1 , Z(α)

where Z(α) = Γ(α)2 /Γ(2α)

(3.46)

(a Beta distribution, with the original uniform prior reproduced by setting α = 1). By tweaking α, the likelihood ratio for H 1 over H0 , Γ(140+α) Γ(110+α) Γ(2α)2250 P (D|H1 , α) = , P (D|H0 ) Γ(250+2α) Γ(α)2

(3.47)

0.05

H0 H1

0.04 0.03 0.02

140

0.01 0 0

50

100

150

200

250

Figure 3.10. The probability distribution of the number of heads given the two hypotheses, that the coin is fair, and that it is biased, with the prior distribution of the bias being uniform. The outcome (D = 140 heads) gives weak evidence in favour of H0 , the hypothesis that the coin is fair.

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

64

3 — More about Inference

can be increased a little. It is shown for several values of α in figure 3.11. Even the most favourable choice of α (α ' 50) can yield a likelihood ratio of only two to one in favour of H1 . In conclusion, the data are not ‘very suspicious’. They can be construed as giving at most two-to-one evidence in favour of one or other of the two hypotheses. Are these wimpy likelihood ratios the fault of over-restrictive priors? Is there any way of producing a ‘very suspicious’ conclusion? The prior that is bestmatched to the data, in terms of likelihood, is the prior that sets p to f ≡ 140/250 with probability one. Let’s call this model H∗ . The likelihood ratio is P (D|H∗ )/P (D|H0 ) = 2250 f 140 (1 − f )110 = 6.1. So the strongest evidence that these data can possibly muster against the hypothesis that there is no bias is six-to-one.

While we are noticing the absurdly misleading answers that ‘sampling theory’ statistics produces, such as the p-value of 7% in the exercise we just solved, let’s stick the boot in. If we make a tiny change to the data set, increasing the number of heads in 250 tosses from 140 to 141, we find that the p-value goes below the mystical value of 0.05 (the p-value is 0.0497). The sampling theory statistician would happily squeak ‘the probability of getting a result as extreme as 141 heads is smaller than 0.05 – we thus reject the null hypothesis at a significance level of 5%’. The correct answer is shown for several values of α in figure 3.12. The values worth highlighting from this table are, first, the likelihood ratio when H1 uses the standard uniform prior, which is 1:0.61 in favour of the null hypothesis H0 . Second, the most favourable choice of α, from the point of view of H1 , can only yield a likelihood ratio of about 2.3:1 in favour of H1 . Be warned! A p-value of 0.05 is often interpreted as implying that the odds are stacked about twenty-to-one against the null hypothesis. But the truth in this case is that the evidence either slightly favours the null hypothesis, or disfavours it by at most 2.3 to one, depending on the choice of prior. The p-values and ‘significance levels’ of classical statistics should be treated with extreme caution. Shun them! Here ends the sermon.

α

P (D|H1 , α) P (D|H0 )

.37 1.0 2.7 7.4 20 55 148 403 1096

.25 .48 .82 1.3 1.8 1.9 1.7 1.3 1.1

Figure 3.11. Likelihood ratio for various choices of the prior distribution’s hyperparameter α.

α

P (D0 |H1 , α) P (D0 |H0 )

.37 1.0 2.7 7.4 20 55 148 403 1096

.32 .61 1.0 1.6 2.2 2.3 1.9 1.4 1.2

Figure 3.12. Likelihood ratio for various choices of the prior distribution’s hyperparameter α, when the data are D 0 = 141 heads in 250 trials.

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

Part I

Data Compression

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

About Chapter 4 In this chapter we discuss how to measure the information content of the outcome of a random experiment. This chapter has some tough bits. If you find the mathematical details hard, skim through them and keep going – you’ll be able to enjoy Chapters 5 and 6 without this chapter’s tools. Before reading Chapter 4, you should have read Chapter 2 and worked on exercises 2.21–2.25 and 2.16 (pp.36–37), and exercise 4.1 below. The following exercise is intended to help you think about how to measure information content. Exercise 4.1.[2, p.69] – Please work on this problem before reading Chapter 4. You are given 12 balls, all equal in weight except for one that is either heavier or lighter. You are also given a two-pan balance to use. In each use of the balance you may put any number of the 12 balls on the left pan, and the same number on the right pan, and push a button to initiate the weighing; there are three possible outcomes: either the weights are equal, or the balls on the left are heavier, or the balls on the left are lighter. Your task is to design a strategy to determine which is the odd ball and whether it is heavier or lighter than the others in as few uses of the balance as possible. While thinking about this problem, you may find it helpful to consider the following questions: (a) How can one measure information? (b) When you have identified the odd ball and whether it is heavy or light, how much information have you gained? (c) Once you have designed a strategy, draw a tree showing, for each of the possible outcomes of a weighing, what weighing you perform next. At each node in the tree, how much information have the outcomes so far given you, and how much information remains to be gained? (d) How much information is gained when you learn (i) the state of a flipped coin; (ii) the states of two flipped coins; (iii) the outcome when a four-sided die is rolled? (e) How much information is gained on the first step of the weighing problem if 6 balls are weighed against the other 6? How much is gained if 4 are weighed against 4 on the first step, leaving out 4 balls?

66

Notation x∈A S⊂A S⊆A V =B∪A V =B∩A |A|

x is a member of the set A S is a subset of the set A S is a subset of, or equal to, the set A V is the union of the sets B and A V is the intersection of the sets B and A number of elements in set A

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

4 The Source Coding Theorem 4.1 How to measure the information content of a random variable? In the next few chapters, we’ll be talking about probability distributions and random variables. Most of the time we can get by with sloppy notation, but occasionally, we will need precise notation. Here is the notation that we established in Chapter 2. An ensemble X is a triple (x, AX , PX ), where the outcome x is the value of a random variable, which takes on one of a set of possible values, AX = {a1 , a2 , . . . , ai , . . . , aI }, having P probabilities PX = {p1 , p2 , . . . , pI }, with P (x = ai ) = pi , pi ≥ 0 and ai ∈AX P (x = ai ) = 1.

How can we measure the information content of an outcome x = a i from such an ensemble? In this chapter we examine the assertions 1. that the Shannon information content, 1 , (4.1) pi is a sensible measure of the information content of the outcome x = a i , and h(x = ai ) ≡ log2

2. that the entropy of the ensemble, H(X) =

X

pi log 2

i

1 , pi

(4.2)

is a sensible measure of the ensemble’s average information content. h(p) = log2

10 8

1 p

p

6 4 2 0 0

0.2

0.4

0.6

0.8

1

p

0.001 0.01 0.1 0.2 0.5

h(p) 10.0 6.6 3.3 2.3 1.0

H2 (p) 0.011 0.081 0.47 0.72 1.0

1

Figure 4.1. The Shannon information content h(p) = log2 1p and the binary entropy function H2 (p) = H(p, 1−p) = 1 p log2 p1 + (1 − p) log2 (1−p) as a function of p.

H2 (p)

0.8 0.6 0.4 0.2 0 0

0.2

0.4

0.6

0.8

1

p

Figure 4.1 shows the Shannon information content of an outcome with probability p, as a function of p. The less probable an outcome is, the greater its Shannon information content. Figure 4.1 also shows the binary entropy function, 1 1 H2 (p) = H(p, 1−p) = p log 2 + (1 − p) log 2 , (4.3) p (1 − p) which is the entropy of the ensemble X whose alphabet and probability distribution are AX = {a, b}, PX = {p, (1 − p)}. 67

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

68

4 — The Source Coding Theorem

Information content of independent random variables Why should log 1/pi have anything to do with the information content? Why not some other function of pi ? We’ll explore this question in detail shortly, but first, notice a nice property of this particular function h(x) = log 1/p(x). Imagine learning the value of two independent random variables, x and y. The definition of independence is that the probability distribution is separable into a product: P (x, y) = P (x)P (y). (4.4) Intuitively, we might want any measure of the ‘amount of information gained’ to have the property of additivity – that is, for independent random variables x and y, the information gained when we learn x and y should equal the sum of the information gained if x alone were learned and the information gained if y alone were learned. The Shannon information content of the outcome x, y is h(x, y) = log

1 1 1 1 = log = log + log P (x, y) P (x)P (y) P (x) P (y)

(4.5)

so it does indeed satisfy h(x, y) = h(x) + h(y), if x and y are independent.

(4.6)

Exercise 4.2.[1, p.86] Show that, if x and y are independent, the entropy of the outcome x, y satisfies H(X, Y ) = H(X) + H(Y ).

(4.7)

In words, entropy is additive for independent variables. We now explore these ideas with some examples; then, in section 4.4 and in Chapters 5 and 6, we prove that the Shannon information content and the entropy are related to the number of bits needed to describe the outcome of an experiment.

The weighing problem: designing informative experiments Have you solved the weighing problem (exercise 4.1, p.66) yet? Are you sure? Notice that in three uses of the balance – which reads either ‘left heavier’, ‘right heavier’, or ‘balanced’ – the number of conceivable outcomes is 3 3 = 27, whereas the number of possible states of the world is 24: the odd ball could be any of twelve balls, and it could be heavy or light. So in principle, the problem might be solvable in three weighings – but not in two, since 3 2 < 24. If you know how you can determine the odd weight and whether it is heavy or light in three weighings, then you may read on. If you haven’t found a strategy that always gets there in three weighings, I encourage you to think about exercise 4.1 some more. Why is your strategy optimal? What is it about your series of weighings that allows useful information to be gained as quickly as possible? The answer is that at each step of an optimal procedure, the three outcomes (‘left heavier’, ‘right heavier’, and ‘balance’) are as close as possible to equiprobable. An optimal solution is shown in figure 4.2. Suboptimal strategies, such as weighing balls 1–6 against 7–12 on the first step, do not achieve all outcomes with equal probability: these two sets of balls can never balance, so the only possible outcomes are ‘left heavy’ and ‘right heavy’. Such a binary outcome rules out only half of the possible hypotheses,

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

4.1: How to measure the information content of a random variable?

1+ 2+ 3+ 4+ 5+ 6+ 7+ 8+ 9+ 10+ 11+ 12+ 1− 2− 3− 4− 5− 6− 7− 8− 9− 10− 11− 12−

weigh 1234 5678

 B

  B B



 



 



 





-

B

B B

B

B B

B

B B

B

BNB

1+ 2+ 3+ 4+ 5− 6− 7− 8−

1− 2− 3− 4− 5+ 6+ 7+ 8+

9+ 10+ 11+ 12+ 9− 10− 11− 12−

weigh 126 345

weigh 126 345

weigh 9 10 11 123

   

  A A A A U A

   

  A A A A U A

   

  A A A A U A

1+ 2+ 5−

+ + −

3 4 6

− −

7 8

6+ 3− 4−

1− 2− 5+

+ +

7 8

69

1 2

 @ R @

3 4

 @ R @

1 7

 @ R @

3 4

 @ R @

1 2

 @ R @

7 1

 @ R @

1+ 2+ 5− 3+ 4+ 6− 7− 8− ? 4− 3− 6+ 2− 1− 5+ 7+ 8+ ? 9+

9+ 10+ 11+

9 10

 - 10+ @ R @ 11+

9− 10− 11−

9 10

10−  - 9− @ R @ 11−

12+  - 12− 12 12 @ R @ ? Figure 4.2. An optimal solution to the weighing problem. At each step there are two boxes: the left box shows which hypotheses are still possible; the right box shows the balls involved in the next weighing. The 24 hypotheses are written 1+ , . . . , 12− , with, e.g., 1+ denoting that 1 is the odd ball and it is heavy. Weighings are written by listing the names of the balls on the two pans, separated by a line; for example, in the first weighing, balls 1, 2, 3, and 4 are put on the left-hand side and 5, 6, 7, and 8 on the right. In each triplet of arrows the upper arrow leads to the situation when the left side is heavier, the middle arrow to the situation when the right side is heavier, and the lower arrow to the situation when the outcome is balanced. The three points labelled ? correspond to impossible outcomes. +



12 1

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

70

4 — The Source Coding Theorem

so a strategy that uses such outcomes must sometimes take longer to find the right answer. The insight that the outcomes should be as near as possible to equiprobable makes it easier to search for an optimal strategy. The first weighing must divide the 24 possible hypotheses into three groups of eight. Then the second weighing must be chosen so that there is a 3:3:2 split of the hypotheses. Thus we might conclude: the outcome of a random experiment is guaranteed to be most informative if the probability distribution over outcomes is uniform. This conclusion agrees with the property of the entropy that you proved when you solved exercise 2.25 (p.37): the entropy of an ensemble X is biggest if all the outcomes have equal probability p i = 1/|AX |.

Guessing games In the game of twenty questions, one player thinks of an object, and the other player attempts to guess what the object is by asking questions that have yes/no answers, for example, ‘is it alive?’, or ‘is it human?’ The aim is to identify the object with as few questions as possible. What is the best strategy for playing this game? For simplicity, imagine that we are playing the rather dull version of twenty questions called ‘sixty-three’. Example 4.3. The game ‘sixty-three’. What’s the smallest number of yes/no questions needed to identify an integer x between 0 and 63? Intuitively, the best questions successively divide the 64 possibilities into equal sized sets. Six questions suffice. One reasonable strategy asks the following questions: 1: 2: 3: 4: 5: 6:

is is is is is is

x ≥ 32? x mod 32 ≥ 16? x mod 16 ≥ 8? x mod 8 ≥ 4? x mod 4 ≥ 2? x mod 2 = 1?

[The notation x mod 32, pronounced ‘x modulo 32’, denotes the remainder when x is divided by 32; for example, 35 mod 32 = 3 and 32 mod 32 = 0.] The answers to these questions, if translated from {yes, no} to {1, 0}, give the binary expansion of x, for example 35 ⇒ 100011. 2

What are the Shannon information contents of the outcomes in this example? If we assume that all values of x are equally likely, then the answers to the questions are independent and each has Shannon information content log 2 (1/0.5) = 1 bit; the total Shannon information gained is always six bits. Furthermore, the number x that we learn from these questions is a six-bit binary number. Our questioning strategy defines a way of encoding the random variable x as a binary file. So far, the Shannon information content makes sense: it measures the length of a binary file that encodes x. However, we have not yet studied ensembles where the outcomes have unequal probabilities. Does the Shannon information content make sense there too?

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

4.1: How to measure the information content of a random variable? A B C D E F G H

×j

×j

1 2 3 4 5 6 7 8

move # question outcome

×

×

×××× ××× ×××× ××× ×j ××× ×××× × ×××× ××××

71 ×××××× × ×××××× ×××××× ×××××× ××××× ×j ×××××× ×××××× ×××××

×××××× × ×××××× ×××××× ×××××× ×××××× ×××××× ××××× ×j S ×××××

1 G3 x=n

2 B1 x=n

32 E5 x=n

48 F3 x=n

P (x)

63 64

62 63

32 33

16 17

1 16

h(x)

0.0227

0.0230

0.0443

0.0874

4.0

Total info.

0.0227

0.0458

1.0

2.0

6.0 Figure 4.3. A game of submarine. The submarine is hit on the 49th attempt.

The game of submarine: how many bits can one bit convey? In the game of battleships, each player hides a fleet of ships in a sea represented by a square grid. On each turn, one player attempts to hit the other’s ships by firing at one square in the opponent’s sea. The response to a selected square such as ‘G3’ is either ‘miss’, ‘hit’, or ‘hit and destroyed’. In a boring version of battleships called submarine, each player hides just one submarine in one square of an eight-by-eight grid. Figure 4.3 shows a few pictures of this game in progress: the circle represents the square that is being fired at, and the ×s show squares in which the outcome was a miss, x = n; the submarine is hit (outcome x = y shown by the symbol s) on the 49th attempt. Each shot made by a player defines an ensemble. The two possible outcomes are {y, n}, corresponding to a hit and a miss, and their probabilities depend on the state of the board. At the beginning, P (y) = 1/64 and P (n) = 63/64. At the second shot, if the first shot missed, P (y) = 1/63 and P (n) = 62/63. At the third shot, if the first two shots missed, P (y) = 1/62 and P (n) = 61/62. The Shannon information gained from an outcome x is h(x) = log(1/P (x)). If we are lucky, and hit the submarine on the first shot, then h(x) = h(1) (y) = log 2 64 = 6 bits.

(4.8)

Now, it might seem a little strange that one binary outcome can convey six bits. But we have learnt the hiding place, which could have been any of 64 squares; so we have, by one lucky binary question, indeed learnt six bits. What if the first shot misses? The Shannon information that we gain from this outcome is 64 = 0.0227 bits. (4.9) h(x) = h(1) (n) = log2 63 Does this make sense? It is not so obvious. Let’s keep going. If our second shot also misses, the Shannon information content of the second outcome is h(2) (n) = log2

63 = 0.0230 bits. 62

(4.10)

If we miss thirty-two times (firing at a new square each time), the total Shannon information gained is 64 63 33 + log 2 + · · · + log 2 63 62 32 = 0.0227 + 0.0230 + · · · + 0.0430 = 1.0 bits.

49 H3 x=y

log 2

(4.11)

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

72

4 — The Source Coding Theorem

Why this round number? Well, what have we learnt? We now know that the submarine is not in any of the 32 squares we fired at; learning that fact is just like playing a game of sixty-three (p.70), asking as our first question ‘is x one of the thirty-two numbers corresponding to these squares I fired at?’, and receiving the answer ‘no’. This answer rules out half of the hypotheses, so it gives us one bit. After 48 unsuccessful shots, the information gained is 2 bits: the unknown location has been narrowed down to one quarter of the original hypothesis space. What if we hit the submarine on the 49th shot, when there were 16 squares left? The Shannon information content of this outcome is h(49) (y) = log2 16 = 4.0 bits.

(4.12)

The total Shannon information content of all the outcomes is 64 63 17 16 + log 2 + · · · + log 2 + log 2 63 62 16 1 = 0.0227 + 0.0230 + · · · + 0.0874 + 4.0 = 6.0 bits.

log 2

(4.13)

So once we know where the submarine is, the total Shannon information content gained is 6 bits. This result holds regardless of when we hit the submarine. If we hit it when there are n squares left to choose from – n was 16 in equation (4.13) – then the total information gained is: 64 63 n+1 n + log 2 + · · · + log 2 + log2 63 62 n   1 n+1 n 64 64 63 = log 2 × × ··· × × = log2 = 6 bits. (4.14) 63 62 n 1 1

log2

What have we learned from the examples so far? I think the submarine example makes quite a convincing case for the claim that the Shannon information content is a sensible measure of information content. And the game of sixty-three shows that the Shannon information content can be intimately connected to the size of a file that encodes the outcomes of a random experiment, thus suggesting a possible connection to data compression. In case you’re not convinced, let’s look at one more example.

The Wenglish language Wenglish is a language similar to English. Wenglish sentences consist of words drawn at random from the Wenglish dictionary, which contains 2 15 = 32,768 words, all of length 5 characters. Each word in the Wenglish dictionary was constructed at random by picking five letters from the probability distribution over a. . .z depicted in figure 2.1. Some entries from the dictionary are shown in alphabetical order in figure 4.4. Notice that the number of words in the dictionary (32,768) is much smaller than the total number of possible words of length 5 letters, 265 ' 12,000,000. Because the probability of the letter z is about 1/1000, only 32 of the words in the dictionary begin with the letter z. In contrast, the probability of the letter a is about 0.0625, and 2048 of the words begin with the letter a. Of those 2048 words, two start az, and 128 start aa. Let’s imagine that we are reading a Wenglish document, and let’s discuss the Shannon information content of the characters as we acquire them. If we

1 2 3

aaail aaaiu aaald .. .

129

abati .. .

2047 2048

azpan aztdn .. . .. . odrcr .. . .. . zatnt .. .

16 384

32 737 32 768

zxast

Figure 4.4. The Wenglish dictionary.

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

4.2: Data compression

73

are given the text one word at a time, the Shannon information content of each five-character word is log 32,768 = 15 bits, since Wenglish uses all its words with equal probability. The average information content per character is therefore 3 bits. Now let’s look at the information content if we read the document one character at a time. If, say, the first letter of a word is a, the Shannon information content is log 1/0.0625 ' 4 bits. If the first letter is z, the Shannon information content is log 1/0.001 ' 10 bits. The information content is thus highly variable at the first character. The total information content of the 5 characters in a word, however, is exactly 15 bits; so the letters that follow an initial z have lower average information content per character than the letters that follow an initial a. A rare initial letter such as z indeed conveys more information about what the word is than a common initial letter. Similarly, in English, if rare characters occur at the start of the word (e.g. xyl...), then often we can identify the whole word immediately; whereas words that start with common characters (e.g. pro...) require more characters before we can identify them.

4.2 Data compression The preceding examples justify the idea that the Shannon information content of an outcome is a natural measure of its information content. Improbable outcomes do convey more information than probable outcomes. We now discuss the information content of a source by considering how many bits are needed to describe the outcome of an experiment. If we can show that we can compress data from a particular source into a file of L bits per source symbol and recover the data reliably, then we will say that the average information content of that source is at most L bits per symbol.

Example: compression of text files A file is composed of a sequence of bytes. A byte is composed of 8 bits and can have a decimal value between 0 and 255. A typical text file is composed of the ASCII character set (decimal values 0 to 127). This character set uses only seven of the eight bits in a byte. . Exercise 4.4.[1, p.86] By how much could the size of a file be reduced given that it is an ASCII file? How would you achieve this reduction? Intuitively, it seems reasonable to assert that an ASCII file contains 7/8 as much information as an arbitrary file of the same size, since we already know one out of every eight bits before we even look at the file. This is a simple example of redundancy. Most sources of data have further redundancy: English text files use the ASCII characters with non-equal frequency; certain pairs of letters are more probable than others; and entire words can be predicted given the context and a semantic understanding of the text.

Some simple data compression methods that define measures of information content One way of measuring the information content of a random variable is simply to count the number of possible outcomes, |A X |. (The number of elements in a set A is denoted by |A|.) If we gave a binary name to each outcome, the

Here we use the word ‘bit’ with its meaning, ‘a symbol with two values’, not to be confused with the unit of information content.

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

74

4 — The Source Coding Theorem

length of each name would be log 2 |AX | bits, if |AX | happened to be a power of 2. We thus make the following definition. The raw bit content of X is H0 (X) = log2 |AX |.

(4.15)

H0 (X) is a lower bound for the number of binary questions that are always guaranteed to identify an outcome from the ensemble X. It is an additive quantity: the raw bit content of an ordered pair x, y, having |A X ||AY | possible outcomes, satisfies H0 (X, Y ) = H0 (X) + H0 (Y ). (4.16) This measure of information content does not include any probabilistic element, and the encoding rule it corresponds to does not ‘compress’ the source data, it simply maps each outcome to a constant-length binary string. Exercise 4.5.[2, p.86] Could there be a compressor that maps an outcome x to a binary code c(x), and a decompressor that maps c back to x, such that every possible outcome is compressed into a binary code of length shorter than H0 (X) bits? Even though a simple counting argument shows that it is impossible to make a reversible compression program that reduces the size of all files, amateur compression enthusiasts frequently announce that they have invented a program that can do this – indeed that they can further compress compressed files by putting them through their compressor several times. Stranger yet, patents have been granted to these modern-day alchemists. See the comp.compression frequently asked questions for further reading. 1 There are only two ways in which a ‘compressor’ can actually compress files: 1. A lossy compressor compresses some files, but maps some files to the same encoding. We’ll assume that the user requires perfect recovery of the source file, so the occurrence of one of these confusable files leads to a failure (though in applications such as image compression, lossy compression is viewed as satisfactory). We’ll denote by δ the probability that the source string is one of the confusable files, so a lossy compressor has a probability δ of failure. If δ can be made very small then a lossy compressor may be practically useful. 2. A lossless compressor maps all files to different encodings; if it shortens some files, it necessarily makes others longer. We try to design the compressor so that the probability that a file is lengthened is very small, and the probability that it is shortened is large. In this chapter we discuss a simple lossy compressor. In subsequent chapters we discuss lossless compression methods.

4.3 Information content defined in terms of lossy compression Whichever type of compressor we construct, we need somehow to take into account the probabilities of the different outcomes. Imagine comparing the information contents of two text files – one in which all 128 ASCII characters 1

http://sunsite.org.uk/public/usenet/news-faqs/comp.compression/

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

4.3: Information content defined in terms of lossy compression

75

are used with equal probability, and one in which the characters are used with their frequencies in English text. Can we define a measure of information content that distinguishes between these two files? Intuitively, the latter file contains less information per character because it is more predictable. One simple way to use our knowledge that some symbols have a smaller probability is to imagine recoding the observations into a smaller alphabet – thus losing the ability to encode some of the more improbable symbols – and then measuring the raw bit content of the new alphabet. For example, we might take a risk when compressing English text, guessing that the most infrequent characters won’t occur, and make a reduced ASCII code that omits the characters { !, @, #, %, ^, *, ~, <, >, /, \, _, {, }, [, ], | }, thereby reducing the size of the alphabet by seventeen. The larger the risk we are willing to take, the smaller our final alphabet becomes. We introduce a parameter δ that describes the risk we are taking when using this compression method: δ is the probability that there will be no name for an outcome x. Example 4.6. Let AX = { a, b, c, d, e, f, g, h }, (4.17) 3 1 1 1 1 and PX = { 14 , 14 , 14 , 16 , 64 , 64 , 64 , 64 }. The raw bit content of this ensemble is 3 bits, corresponding to 8 binary names. But notice that P (x ∈ {a, b, c, d}) = 15/16. So if we are willing to run a risk of δ = 1/16 of not having a name for x, then we can get by with four names – half as many names as are needed if every x ∈ A X has a name. Table 4.5 shows binary names that could be given to the different outcomes in the cases δ = 0 and δ = 1/16. When δ = 0 we need 3 bits to encode the outcome; when δ = 1/16 we need only 2 bits. Let us now formalize this idea. To make a compression strategy with risk δ, we make the smallest possible subset S δ such that the probability that x is not in Sδ is less than or equal to δ, i.e., P (x 6∈ S δ ) ≤ δ. For each value of δ we can then define a new measure of information content – the log of the size of this smallest subset Sδ . [In ensembles in which several elements have the same probability, there may be several smallest subsets that contain different elements, but all that matters is their sizes (which are equal), so we will not dwell on this ambiguity.] The smallest δ-sufficient subset Sδ is the smallest subset of AX satisfying P (x ∈ Sδ ) ≥ 1 − δ.

(4.18)

The subset Sδ can be constructed by ranking the elements of A X in order of decreasing probability and adding successive elements starting from the most probable elements until the total probability is ≥ (1−δ). We can make a data compression code by assigning a binary name to each element of the smallest sufficient subset. This compression scheme motivates the following measure of information content: The essential bit content of X is: Hδ (X) = log 2 |Sδ |.

(4.19)

Note that H0 (X) is the special case of Hδ (X) with δ = 0 (if P (x) > 0 for all x ∈ AX ). [Caution: do not confuse H0 (X) and Hδ (X) with the function H2 (p) displayed in figure 4.1.] Figure 4.6 shows Hδ (X) for the ensemble of example 4.6 as a function of δ.

δ=0

δ = 1/16

x

c(x)

x

c(x)

a b c d e f g h

000 001 010 011 100 101 110 111

a b c d e f g h

00 01 10 11 − − − −

Table 4.5. Binary names for the outcomes, for two failure probabilities δ.

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

76

4 — The Source Coding Theorem −6

−4

−2.4

S

S0

−2

log 2 P (x)

-

1 16

6

6

e,f,g,h

d

6 a,b,c

(a) {a,b,c,d,e,f,g,h} {a,b,c,d,e,f,g} {a,b,c,d,e,f} {a,b,c,d,e}

3 2.5

Hδ (X)

{a,b,c,d}

2

{a,b,c}

1.5 1

{a,b}

0.5 {a} 0 0

(b)

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

δ

Extended ensembles Is this compression method any more useful if we compress blocks of symbols from a source? We now turn to examples where the outcome x = (x 1 , x2 , . . . , xN ) is a string of N independent identically distributed random variables from a single ensemble X. We will denote by X N the ensemble (X1 , X2 , . . . , XN ). Remember that entropy is additive for independent variables (exercise 4.2 (p.68)), so H(X N ) = N H(X). Example 4.7. Consider a string of N flips of a bent coin, x = (x 1 , x2 , . . . , xN ), where xn ∈ {0, 1}, with probabilities p0 = 0.9, p1 = 0.1. The most probable strings x are those with most 0s. If r(x) is the number of 1s in x then N −r(x) r(x) P (x) = p0 p1 . (4.20) To evaluate Hδ (X N ) we must find the smallest sufficient subset S δ . This subset will contain all x with r(x) = 0, 1, 2, . . . , up to some r max (δ) − 1, and some of the x with r(x) = rmax (δ). Figures 4.7 and 4.8 show graphs of Hδ (X N ) against δ for the cases N = 4 and N = 10. The steps are the values of δ at which |Sδ | changes by 1, and the cusps where the slope of the staircase changes are the points where r max changes by 1. Exercise 4.8.[2, p.86] What are the mathematical shapes of the curves between the cusps? For the examples shown in figures 4.6–4.8, H δ (X N ) depends strongly on the value of δ, so it might not seem a fundamental or useful definition of information content. But we will consider what happens as N , the number of independent variables in X N , increases. We will find the remarkable result that Hδ (X N ) becomes almost independent of δ – and for all δ it is very close to N H(X), where H(X) is the entropy of one of the random variables. Figure 4.9 illustrates this asymptotic tendency for the binary ensemble of example 4.7. As N increases, N1 Hδ (X N ) becomes an increasingly flat function,

Figure 4.6. (a) The outcomes of X (from example 4.6 (p.75)), ranked by their probability. (b) The essential bit content Hδ (X). The labels on the graph show the smallest sufficient set as a function of δ. Note H0 (X) = 3 bits and H1/16 (X) = 2 bits.

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

4.3: Information content defined in terms of lossy compression

−14

−12

−10

−8

−6

S0.01

6 1111

−4

log2 P (x) 0 -

S0.1

6

6

1101, 1011, . . .

−2

77

6

0110, 1010, . . .

6

0010, 0001, . . .

0000

Figure 4.7. (a) The sixteen outcomes of the ensemble X 4 with p1 = 0.1, ranked by probability. (b) The essential bit content Hδ (X 4 ). The upper schematic diagram indicates the strings’ probabilities by the vertical lines’ lengths (not to scale).

(a) 4 N=4 3.5

4

Hδ (X )

3 2.5 2 1.5 1 0.5 0

(b)

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

δ

10

Figure 4.8. Hδ (X N ) for N = 10 binary variables with p1 = 0.1.

N=10

Hδ (X 10 )

8

6

4

2

0 0

0.2

0.4

0.6

0.8

1

δ

1

1 N N Hδ (X )

Figure 4.9. N1 Hδ (X N ) for N = 10, 210, . . . , 1010 binary variables with p1 = 0.1.

N=10 N=210 N=410 N=610 N=810 N=1010

0.8

0.6

0.4

0.2

0 0

0.2

0.4

0.6

0.8

1

δ

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

78

4 — The Source Coding Theorem

x

log 2 (P (x))

...1...................1.....1....1.1.......1........1...........1.....................1.......11... ......................1.....1.....1.......1....1.........1.....................................1.... ........1....1..1...1....11..1.1.........11.........................1...1.1..1...1................1. 1.1...1................1.......................11.1..1............................1.....1..1.11..... ...11...........1...1.....1.1......1..........1....1...1.....1............1......................... ..............1......1.........1.1.......1..........1............1...1......................1....... .....1........1.......1...1............1............1...........1......1..11........................ .....1..1..1...............111...................1...............1.........1.1...1...1.............1 .........1..........1.....1......1..........1....1..............................................1... ......1........................1..............1.....1..1.1.1..1...................................1. 1.......................1..........1...1...................1....1....1........1..11..1.1...1........ ...........11.1.........1................1......1.....................1............................. .1..........1...1.1.............1.......11...........1.1...1..............1.............11.......... ......1...1..1.....1..11.1.1.1...1.....................1............1.............1..1.............. ............11.1......1....1..1............................1.......1..............1.......1.........

.................................................................................................... 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111

−50.1 −37.3 −65.9 −56.4 −53.2 −43.7 −46.8 −56.4 −37.3 −43.7 −56.4 −37.3 −56.4 −59.5 −46.8 −15.2 −332.1

except for tails close to δ = 0 and 1. As long as we are allowed a tiny probability of error δ, compression down to N H bits is possible. Even if we are allowed a large probability of error, we still can compress only down to N H bits. This is the source coding theorem. Theorem 4.1 Shannon’s source coding theorem. Let X be an ensemble with entropy H(X) = H bits. Given  > 0 and 0 < δ < 1, there exists a positive integer N0 such that for N > N0 , 1 Hδ (X N ) − H < . (4.21) N

4.4 Typicality

Why does increasing N help? Let’s examine long strings from X N . Table 4.10 shows fifteen samples from X N for N = 100 and p1 = 0.1. The probability of a string x that contains r 1s and N −r 0s is P (x) = pr1 (1 − p1 )N −r . The number of strings that contain r 1s is   N n(r) = . r So the number of 1s, r, has a binomial distribution:   N r p (1 − p1 )N −r . P (r) = r 1

(4.22)

(4.23)

(4.24)

These functions are shown in figure 4.11. The mean of r is N p 1 , and its p standard deviation is N p1 (1 − p1 ) (p.1). If N is 100 then r ∼ N p1 ±

p

N p1 (1 − p1 ) ' 10 ± 3.

(4.25)

Figure 4.10. The top 15 strings are samples from X 100 , where p1 = 0.1 and p0 = 0.9. The bottom two are the most and least probable strings in this ensemble. The final column shows the log-probabilities of the random strings, which may be compared with the entropy H(X 100 ) = 46.9 bits.

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

4.4: Typicality

79

N = 100

n(r) =

N r



N = 1000

1.2e+29

3e+299

1e+29

2.5e+299

8e+28

2e+299

6e+28

1.5e+299

4e+28

1e+299

2e+28

5e+298

0

0 0

10 20 30 40 50 60 70 80 90 100

0 100 200 300 400 500 600 700 800 9001000

2e-05

P (x) = pr1 (1 − p1 )N −r

2e-05 1e-05

0

1e-05

0

1

2

3

4

5

0 0

log2 P (x)

10 20 30 40 50 60 70 80 90 100

0

0

-50

-500

-100

-1000

T

-150

-1500

-200

-2000

-250

-2500

-300

-3000

-350

-3500 0

10 20 30 40 50 60 70 80 90 100

0.14

N r



pr1 (1 − p1 )N −r

0 100 200 300 400 500 600 700 800 9001000 0.045 0.04 0.035 0.03 0.025 0.02 0.015 0.01 0.005 0

0.12

n(r)P (x) =

T

0.1 0.08 0.06 0.04 0.02 0 0

10 20 30 40 50 60 70 80 90 100

r

0 100 200 300 400 500 600 700 800 9001000

r

Figure 4.11. Anatomy of the typical set T . For p1 = 0.1 and N = 100 and N = 1000, these graphs show n(r), the number of strings containing r 1s; the probability P (x) of a single string that contains r 1s; the same probability on a log scale; and the total probability n(r)P (x) of all strings that contain r 1s. The number r is on the horizontal axis. The plot of log 2 P (x) also shows by a dotted line the mean value of log2 P (x) = −N H2 (p1 ), which equals −46.9 when N = 100 and −469 when N = 1000. The typical set includes only the strings that have log2 P (x) close to this value. The range marked T shows the set TN β (as defined in section 4.4) for N = 100 and β = 0.29 (left) and N = 1000, β = 0.09 (right).

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

80

4 — The Source Coding Theorem

If N = 1000 then r ∼ 100 ± 10.

(4.26)

Notice that as N gets bigger, the probability distribution of r becomes more concentrated, in the sense that while the range √ of possible values of r grows as N , the standard deviation of r grows only as N . That r is most likely to fall in a small range of values implies that the outcome x is also most likely to fall in a corresponding small subset of outcomes that we will call the typical set.

Definition of the typical set Let us define typicality for an arbitrary ensemble X with alphabet A X . Our definition of a typical string will involve the string’s probability. A long string of N symbols will usually contain about p 1 N occurrences of the first symbol, p2 N occurrences of the second, etc. Hence the probability of this string is roughly (p1 N ) (p2 N ) p2

P (x)typ = P (x1 )P (x2 )P (x3 ) . . . P (xN ) ' p1

so that the information content of a typical string is X 1 1 log2 ' N = N H. pi log2 P (x) pi

(pI N )

. . . pI

(4.27)

(4.28)

i

So the random variable log 2 1/P (x), which is the information content of x, is very likely to be close in value to N H. We build our definition of typicality on this observation. We define the typical elements of AN X to be those elements that have prob−N H ability close to 2 . (Note that the typical set, unlike the smallest sufficient subset, does not include the most probable elements of A N X , but we will show that these most probable elements contribute negligible probability.) We introduce a parameter β that defines how close the probability has to be to 2−N H for an element to be ‘typical’. We call the set of typical elements the typical set, TN β :   1 N 1 − H < β . (4.29) TN β ≡ x ∈ AX : log2 N P (x) We will show that whatever value of β we choose, the typical set contains almost all the probability as N increases. This important result is sometimes called the ‘asymptotic equipartition’ principle.

‘Asymptotic equipartition’ principle. For an ensemble of N independent identically distributed (i.i.d.) random variables X N ≡ (X1 , X2 , . . . , XN ), with N sufficiently large, the outcome x = (x 1 , x2 , . . . , xN ) is almost N H(X) members, each certain to belong to a subset of AN X having only 2 −N H(X) having probability ‘close to’ 2 . Notice that if H(X) < H0 (X) then 2N H(X) is a tiny fraction of the number N N H0 (X) . of possible outcomes |AN X | = |AX | = 2 The term equipartition is chosen to describe the idea that the members of the typical set have roughly equal probability. [This should not be taken too literally, hence my use of quotes around ‘asymptotic equipartition’; see page 83.] A second meaning for equipartition, in thermal physics, is the idea that each degree of freedom of a classical system has equal average energy, 12 kT . This second meaning is not intended here.

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

4.5: Proofs

81 log2 P (x) −N H(X)

-

TN β

6

6

6

6

6

1111111111110. . . 11111110111 0000100000010. . . 00001000010 0100000001000. . . 00010000000 0001000000000. . . 00000000000 0000000000000. . . 00000000000

The ‘asymptotic equipartition’ principle is equivalent to: Shannon’s source coding theorem (verbal statement). N i.i.d. random variables each with entropy H(X) can be compressed into more than N H(X) bits with negligible risk of information loss, as N → ∞; conversely if they are compressed into fewer than N H(X) bits it is virtually certain that information will be lost. These two theorems are equivalent because we can define a compression algorithm that gives a distinct name of length N H(X) bits to each x in the typical set.

4.5 Proofs This section may be skipped if found tough going.

The law of large numbers Our proof of the source coding theorem uses the law of large numbers. P Mean and variance of a real random variable are E[u] = u ¯ = u P (u)u P and var(u) = σu2 = E[(u − u ¯)2 ] = u P (u)(u − u ¯)2 .

Technical note: strictly I am assuming here that u is a function u(x) of P a sample x from a finite discrete P ensemble X. Then the summations u P (u)f (u) should be written x P (x)f (u(x)). This means that P (u) is a finite sum of delta functions. This restriction guarantees that the mean and variance of u do exist, which is not necessarily the case for general P (u).

Chebyshev’s inequality 1. Let t be a non-negative real random variable, and let α be a positive real number. Then P (t ≥ α) ≤

t¯ . α

(4.30)

P Proof: P (t ≥ α) = P t≥α P (t). We multiply each term by t/α ≥ 1 and obtain: P (t ≥ α) ≤ t≥α P (t)t/α. P We add the (non-negative) missing 2 terms and obtain: P (t ≥ α) ≤ t P (t)t/α = t¯/α.

Figure 4.12. Schematic diagram showing all strings in the ensemble X N ranked by their probability, and the typical set TN β .

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

82

4 — The Source Coding Theorem

Chebyshev’s inequality 2. Let x be a random variable, and let α be a positive real number. Then  P (x − x ¯)2 ≥ α ≤ σx2 /α. (4.31) Proof: Take t = (x − x ¯)2 and apply the previous proposition.

2

Weak law of large numbers. Take x to be the average of N independent ¯ and common varirandom variablesPh1 , . . . , hN , having common mean h h . Then ance σh2 : x = N1 N n=1 n ¯ 2 ≥ α) ≤ σ 2 /αN. P ((x − h) (4.32) h

¯ and that σ 2 = σ 2 /N . Proof: obtained by showing that x ¯=h x h

2

We are interested in x being very close to the mean (α very small). No matter how large σh2 is, and no matter how small the required α is, and no matter ¯ 2 ≥ α, we can always achieve it how small the desired probability that (x − h) by taking N large enough.

Proof of theorem 4.1 (p.78) 1 We apply the law of large numbers to the random variable N1 log2 P (x) defined N for x drawn from the ensemble X . This random variable can be written as the average of N information contents h n = log 2 (1/P (xn )), each of which is a random variable with mean H = H(X) and variance σ 2 ≡ var[log 2 (1/P (xn ))]. (Each term hn is the Shannon information content of the nth outcome.) We again define the typical set with parameters N and β thus: ) ( 2  1 1 (4.33) log 2 − H < β2 . TN β = x ∈ A N X : N P (x)

For all x ∈ TN β , the probability of x satisfies

2−N (H+β) < P (x) < 2−N (H−β) .

(4.34)

And by the law of large numbers, σ2 . (4.35) β2N We have thus proved the ‘asymptotic equipartition’ principle. As N increases, the probability that x falls in TN β approaches 1, for any β. How does this result relate to source coding? We must relate TN β to Hδ (X N ). We will show that for any given δ there is a sufficiently big N such that Hδ (X N ) ' N H. P (x ∈ TN β ) ≥ 1 −

Part 1:

1 N N Hδ (X )

1 N

Hδ (X N )

H0 (X)

< H + .

The set TN β is not the best subset for compression. So the size of T N β gives an upper bound on Hδ . We show how small Hδ (X N ) must be by calculating how big TN β could possibly be. We are free to set β to any convenient value. The smallest possible probability that a member of T N β can have is 2−N (H+β) , and the total probability contained by T N β can’t be any bigger than 1. So |TN β | 2

−N (H+β)

< 1,

σ2 2 N

H H− 0

1

δ

(4.36)

that is, the size of the typical set is bounded by |TN β | < 2N (H+β) .

H+

(4.37)

≤ δ, then P (TN β ) ≥ 1 − δ, and the set If we set β =  and N0 such that 0 TN β becomes a witness to the fact that Hδ (X N ) ≤ log 2 |TN β | < N (H + ).

Figure 4.13. Schematic illustration of the two parts of the theorem. Given any δ and , we show that for large enough N , N1 Hδ (X N ) lies (1) below the line H +  and (2) above the line H − .

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

4.6: Comments Part 2:

1 N N Hδ (X )

83 > H − .

Imagine that someone claims this second part is not so – that, for any N , the smallest δ-sufficient subset Sδ is smaller than the above inequality would allow. We can make use of our typical set to show that they must be mistaken. Remember that we are free to set β to any value we choose. We will set β = /2, so that our task is to prove that a subset S 0 having |S 0 | ≤ 2N (H−2β) and achieving P (x ∈ S 0 ) ≥ 1 − δ cannot exist (for N greater than an N 0 that we will specify). So, let us consider the probability of falling in this rival smaller subset S 0 . The probability of the subset S 0 is 0

0

0

P (x ∈ S ) = P (x ∈ S ∩TN β ) + P (x ∈ S ∩TN β ),

(4.38)

where TN β denotes the complement {x 6∈ TN β }. The maximum value of the first term is found if S 0 ∩ TN β contains 2N (H−2β) outcomes all with the maximum probability, 2−N (H−β) . The maximum value the second term can have is P (x 6∈ TN β ). So: P (x ∈ S 0 ) ≤ 2N (H−2β) 2−N (H−β) +

σ2 σ2 = 2−N β + 2 . 2 β N β N

(4.39)

We can now set β = /2 and N0 such that P (x ∈ S 0 ) < 1 − δ, which shows that S 0 cannot satisfy the definition of a sufficient subset S δ . Thus any subset S 0 with size |S 0 | ≤ 2N (H−) has probability less than 1 − δ, so by the definition of Hδ , Hδ (X N ) > N (H − ). Thus for large enough N , the function N1 Hδ (X N ) is essentially a constant function of δ, for 0 < δ < 1, as illustrated in figures 4.9 and 4.13. 2

4.6 Comments The source coding theorem (p.78) has two parts, N1 Hδ (X N ) < H + , and 1 N N Hδ (X ) > H − . Both results are interesting. The first part tells us that even if the probability of error δ is extremely small, the number of bits per symbol N1 Hδ (X N ) needed to specify a long N -symbol string x with vanishingly small error probability does not have to exceed H +  bits. We need to have only a tiny tolerance for error, and the number of bits required drops significantly from H 0 (X) to (H + ). What happens if we are yet more tolerant to compression errors? Part 2 tells us that even if δ is very close to 1, so that errors are made most of the time, the average number of bits per symbol needed to specify x must still be at least H −  bits. These two extremes tell us that regardless of our specific allowance for error, the number of bits per symbol needed to specify x is H bits; no more and no less.

Caveat regarding ‘asymptotic equipartition’ I put the words ‘asymptotic equipartition’ in quotes because it is important not to think that the elements of the typical set T N β really do have roughly the same probability as each other. They are similar in probability only in 1 the sense that their values of log 2 P (x) are within 2N β of each other. Now, as β is decreased, how does N have to increase, if we are to keep our bound on 2 the mass of the typical set, P (x ∈ TN β ) ≥ 1 − βσ2 N , constant? N must grow √ as 1/β 2 , so, if we write β in terms of N as α/ N , for some constant α, then

TN β

'$ '$ S0

I @ OCC @ S0 ∩ T &% &% Nβ C 0 S ∩ TN β

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

84

4 — The Source Coding Theorem √

the most probable string in the typical set will be of order 2 α N times greater than the least probable string in the typical set. As β decreases, N increases, √ α N and this ratio 2 grows exponentially. Thus we have ‘equipartition’ only in a weak sense!

Why did we introduce the typical set? The best choice of subset for block compression is (by definition) S δ , not a typical set. So why did we bother introducing the typical set? The answer is, we can count the typical set. We know that all its elements have ‘almost identical’ probability (2−N H ), and we know the whole set has probability almost 1, so the typical set must have roughly 2 N H elements. Without the help of the typical set (which is very similar to S δ ) it would have been hard to count how many elements there are in Sδ .

4.7 Exercises Weighing problems . Exercise 4.9.[1 ] While some people, when they first encounter the weighing problem with 12 balls and the three-outcome balance (exercise 4.1 (p.66)), think that weighing six balls against six balls is a good first weighing, others say ‘no, weighing six against six conveys no information at all’. Explain to the second group why they are both right and wrong. Compute the information gained about which is the odd ball, and the information gained about which is the odd ball and whether it is heavy or light. . Exercise 4.10.[2 ] Solve the weighing problem for the case where there are 39 balls of which one is known to be odd. . Exercise 4.11.[2 ] You are given 16 balls, all of which are equal in weight except for one that is either heavier or lighter. You are also given a bizarre twopan balance that can report only two outcomes: ‘the two sides balance’ or ‘the two sides do not balance’. Design a strategy to determine which is the odd ball in as few uses of the balance as possible. . Exercise 4.12.[2 ] You have a two-pan balance; your job is to weigh out bags of flour with integer weights 1 to 40 pounds inclusive. How many weights do you need? [You are allowed to put weights on either pan. You’re only allowed to put one flour bag on the balance at a time.] Exercise 4.13.[4, p.86] (a) Is it possible to solve exercise 4.1 (p.66) (the weighing problem with 12 balls and the three-outcome balance) using a sequence of three fixed weighings, such that the balls chosen for the second weighing do not depend on the outcome of the first, and the third weighing does not depend on the first or second? (b) Find a solution to the general N -ball weighing problem in which exactly one of N balls is odd. Show that in W weighings, an odd ball can be identified from among N = (3 W − 3)/2 balls. Exercise 4.14.[3 ] You are given 12 balls and the three-outcome balance of exercise 4.1; this time, two of the balls are odd; each odd ball may be heavy or light, and we don’t know which. We want to identify the odd balls and in which direction they are odd.

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

4.7: Exercises

85

(a) Estimate how many weighings are required by the optimal strategy. And what if there are three odd balls? (b) How do your answers change if it is known that all the regular balls weigh 100 g, that light balls weigh 99 g, and heavy ones weigh 110 g?

Source coding with a lossy compressor, with loss δ . Exercise 4.15.[2, p.87] Let PX = {0.2, 0.8}. Sketch δ for N = 1, 2 and 1000. . Exercise 4.16.[2 ] Let PY = {0.5, 0.5}. Sketch N = 1, 2, 3 and 100.

1 N N Hδ (X )

1 N N Hδ (Y )

as a function of

as a function of δ for

. Exercise 4.17.[2, p.87] (For physics students.) Discuss the relationship between the proof of the ‘asymptotic equipartition’ principle and the equivalence (for large systems) of the Boltzmann entropy and the Gibbs entropy.

Distributions that don’t obey the law of large numbers The law of large numbers, which we used in this chapter, shows that the mean of a set of N i.i.d. random variables has a probability distribution that becomes √ narrower, with width ∝ 1/ N , as N increases. However, we have proved this property only for discrete random variables, that is, for real numbers taking on a finite set of possible values. While many random variables with continuous probability distributions also satisfy the law of large numbers, there are important distributions that do not. Some continuous distributions do not have a mean or variance. . Exercise 4.18.[3, p.88] Sketch the Cauchy distribution P (x) =

1 1 , Z x2 + 1

x ∈ (−∞, ∞).

(4.40)

What is its normalizing constant Z? Can you evaluate its mean or variance? Consider the sum z = x1 + x2 , where x1 and x2 are independent random variables from a Cauchy distribution. What is P (z)? What is the probability distribution of the mean of x 1 and x2 , x ¯ = (x1 + x2 )/2? What is the probability distribution of the mean of N samples from this Cauchy distribution?

Other asymptotic properties Exercise 4.19.[3 ] Chernoff bound. We derived the weak law of large numbers from Chebyshev’s inequality (4.30) by letting the random variable t in the inequality P (t ≥ α) ≤ t¯/α be a function, t = (x− x ¯)2 , of the random variable x we were interested in. Other useful inequalities can be obtained by using other functions. The Chernoff bound, which is useful for bounding the tails of a distribution, is obtained by letting t = exp(sx). Show that P (x ≥ a) ≤ e−sa g(s),

for any s > 0

(4.41)

P (x ≤ a) ≤ e−sa g(s),

for any s < 0

(4.42)

and

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

86

4 — The Source Coding Theorem where g(s) is the moment-generating function of x, X g(s) = P (x) esx .

(4.43)

x

Curious functions related to p log 1/p Exercise 4.20.[4, p.89] This exercise has no purpose at all; it’s included for the enjoyment of those who like mathematical curiosities. Sketch the function f (x) = x

xx

xx

··

·

(4.44)

for x ≥ 0. Hint: Work out the inverse function to f – that is, the function g(y) such that if x = g(y) then y = f (x) – it’s closely related to p log 1/p.

4.8 Solutions Solution to exercise 4.2 (p.68).

Let P (x, y) = P (x)P (y). Then

1 (4.45) P (x)P (y) xy X X 1 1 = P (x)P (y) log + (4.46) P (x)P (y) log P (x) P (y) xy xy X X 1 1 = + (4.47) P (x) log P (y) log P (x) P (y) x y

H(X, Y ) =

X

P (x)P (y) log

= H(X) + H(Y ).

(4.48)

Solution to exercise 4.4 (p.73). An ASCII file can be reduced in size by a factor of 7/8. This reduction could be achieved by a block code that maps 8-byte blocks into 7-byte blocks by copying the 56 information-carrying bits into 7 bytes, and ignoring the last bit of every character. Solution to exercise 4.5 (p.74). The pigeon-hole principle states: you can’t put 16 pigeons into 15 holes without using one of the holes twice. Similarly, you can’t give AX outcomes unique binary names of some length l shorter than log 2 |AX | bits, because there are only 2l such binary names, and l < log 2 |AX | implies 2l < |AX |, so at least two different inputs to the compressor would compress to the same output file. Solution to exercise 4.8 (p.76). Between the cusps, all the changes in probability are equal, and the number of elements in T changes by one at each step. So Hδ varies logarithmically with (−δ). Solution to exercise 4.13 (p.84). This solution was found by Dyson and Lyness in 1946 and presented in the following elegant form by John Conway in 1999. Be warned: the symbols A, B, and C are used to name the balls, to name the pans of the balance, to name the outcomes, and to name the possible states of the odd ball! (a) Label the 12 balls by the sequences AAB

ABA

and in the

ABB

ABC

BBC

BCA

BCB

BCC

CAA

CAB

CAC

CCA

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

4.8: Solutions

87

1st AAB ABA ABB ABC BBC BCA BCB BCC 2nd weighings put AAB CAA CAB CAC in pan A, ABA ABB ABC BBC in pan B. 3rd ABA BCA CAA CCA AAB ABB BCB CAB

Now in a given weighing, a pan will either end up in the • Canonical position (C) that it assumes when the pans are balanced, or • Above that position (A), or

• Below it (B),

so the three weighings determine for each pan a sequence of three of these letters. If both sequences are CCC, then there’s no odd ball. Otherwise, for just one of the two pans, the sequence is among the 12 above, and names the odd ball, whose weight is Above or Below the proper one according as the pan is A or B. (b) In W weighings the odd ball can be identified from among N = (3W − 3)/2

(4.49)

balls in the same way, by labelling them with all the non-constant sequences of W letters from A, B, C whose first change is A-to-B or B-to-C or C-to-A, and at the wth weighing putting those whose wth letter is A in pan A and those whose wth letter is B in pan B. Solution to exercise 4.15 (p.85). The curves N1 Hδ (X N ) as a function of δ for N = 1, 2 and 1000 are shown in figure 4.14. Note that H 2 (0.2) = 0.72 bits. 1

N =1

N=1 N=2 N=1000 0.8

0.6

0.4

N =2

δ

1 N Hδ (X)

2Hδ (X)

δ

1 N Hδ (X)

2Hδ (X)

0–0.2 0.2–1

1 0

2 1

0–0.04 0.04–0.2 0.2–0.36 0.36–1

1 0.79 0.5 0

4 3 2 1

0.2

0 0

0.2

0.4

0.6

0.8

1

P Solution to exercise 4.17 (p.85). The Gibbs entropy is k B i pi ln p1i , where i runs over all states of the system. This entropy is equivalent (apart from the factor of kB ) to the Shannon entropy of the ensemble. Whereas the Gibbs entropy can be defined for any ensemble, the Boltzmann entropy is only defined for microcanonical ensembles, which have a probability distribution that is uniform over a set of accessible states. The Boltzmann entropy is defined to be SB = kB ln Ω where Ω is the number of accessible states of the microcanonical ensemble. This is equivalent (apart from the factor of kB ) to the perfect information content H 0 of that constrained ensemble. The Gibbs entropy of a microcanonical ensemble is trivially equal to the Boltzmann entropy.

Figure 4.14. N1 Hδ (X) (vertical axis) against δ (horizontal), for N = 1, 2, 100 binary variables with p1 = 0.4.

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

88

4 — The Source Coding Theorem

We now consider a thermal distribution (the canonical ensemble), where the probability of a state x is   1 E(x) P (x) = exp − . (4.50) Z kB T With this canonical ensemble we can associate a corresponding microcanonical ensemble, an ensemble with total energy fixed to the mean energy of the canonical ensemble (fixed to within some precision ). Now, fixing the total energy to a precision  is equivalent to fixing the value of ln 1/P (x) to within kB T . Our definition of the typical set T N β was precisely that it consisted of all elements that have a value of log P (x) very close to the mean value of log P (x) under the canonical ensemble, −N H(X). Thus the microcanonical ensemble is equivalent to a uniform distribution over the typical set of the canonical ensemble. Our proof of the ‘asymptotic equipartition’ principle thus proves – for the case of a system whose energy is separable into a sum of independent terms – that the Boltzmann entropy of the microcanonical ensemble is very close (for large N ) to the Gibbs entropy of the canonical ensemble, if the energy of the microcanonical ensemble is constrained to equal the mean energy of the canonical ensemble. Solution to exercise 4.18 (p.85). The normalizing constant of the Cauchy distribution 1 1 P (x) = Z x2 + 1 is Z ∞  ∞ π −π 1 Z= = tan−1 x −∞ = − = π. (4.51) dx 2 x +1 2 2 −∞

The mean and variance of this distribution are both undefined. (The distribution is symmetrical about zero, but this does not imply that its mean is zero. The mean is the value of a divergent integral.) The sum z = x 1 + x2 , where x1 and x2 both have Cauchy distributions, has probability density given by the convolution Z ∞ 1 1 1 P (z) = 2 , (4.52) dx1 2 π −∞ x1 + 1 (z − x1 )2 + 1

which after a considerable labour using standard methods gives P (z) =

1 1 π 2 2 2 = , 2 2 π z +4 π z + 22

(4.53)

which we recognize as a Cauchy distribution with width parameter 2 (where the original distribution has width parameter 1). This implies that the mean of the two points, x ¯ = (x1 + x2 )/2 = z/2, has a Cauchy distribution with width parameter 1. Generalizing, the mean of N samples from a Cauchy distribution is Cauchy-distributed with the same parameters as the individual samples. √ The probability distribution of the mean does not become narrower as 1/ N . The central-limit theorem does not apply to the Cauchy distribution, because it does not have a finite variance. An alternative neat method for getting to equation (4.53) makes use of the Fourier transform of the Cauchy distribution, which is a biexponential e −|ω| . Convolution in real space corresponds to multiplication in Fourier space, so the Fourier transform of z is simply e −|2ω| . Reversing the transform, we obtain equation (4.53).

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

4.8: Solutions

89

Solution to exercise 4.20 (p.86).

The function f (x) has inverse function 50

g(y) = y

1/y

.

(4.54)

40 30

Note

20

log g(y) = 1/y log y.

(4.55)

I obtained a tentative graph of f (x) by plotting g(y) with y along the vertical axis and g(y) along the horizontal axis. The resulting graph suggests that f (x) is single valued for x ∈ (0, 1), and looks surprisingly well-behaved and √ ordinary; for x ∈ (1, e1/e ), f (x) is two-valued. f ( 2) is equal both to 2 and 4. For x > e1/e (which is about 1.44), f (x) is infinite. However, it might be argued that this approach to sketching f (x) is only partly valid, if we define f x as the limit of the sequence of functions x, x x , xx , . . .; this sequence does not e have a limit for 0 ≤ x ≤ (1/e) ' 0.07 on account of a pitchfork bifurcation at x = (1/e)e ; and for x ∈ (1, e1/e ), the sequence’s limit is single-valued – the lower of the two values sketched in the figure.

10 0 0

0.2

0.4

0.6

0.8

1

1.2

1.4

0.8

1

1.2

1.4

5 4 3 2 1 0 0

0.2

0.4

0.6

0.5 0.4 0.3 0.2 0.1 0 0

0.2

x xx,

xx

Figure 4.15. f (x) = at three different scales.

··

·

shown

Copyright Cambridge University Press 2003. On-screen viewing permitted. Printing not permitted. http://www.cambridge.org/0521642981 You can buy this book for 30 pounds or $50. See http://www.inference.phy.cam.ac.uk/mackay/itila/ for links.

About Chapte