# connection of entropy, minimum bits and coding length

Hello :)

I have two questions concerning "Digitaltechnik & Rechnerarchitektur".

1. On Slide 36 we define the maximal entropy. Which information does it exactly give us? Is it always the entropy rounded up so that we get our minimal number of bits to encode the alphabet?

2. Is there a conncection between the entropy and the average coding length of a letter in our alphabet? I understand that there is a conncetion to the weighted code size average but can we conclude something for our alphabet?

I may have a problem understanding the difference between the number of bits and the coding length. I hope someone can help me :)

Thank you,

Paula

retagged
Please tag question about topics of that course with "DiRa" which will forward the question to our tutors. That may help you to get quickly answers. See https://q2a.cs.uni-kl.de/477/questions-regarding-digitaltechnik-rechnerarchitektur.

Maybe my answer here is too long, but let me first clarify the context (since I think that it is required for understanding the final answer below): If we have any code that maps the letters σ_1,...,σ_n of an alphabet to code words (σ_1),...,(σ_n), then the length of a word α=a_1...a_k of length k where letter σ_i occurs n_i many times, i.e., n_i many of the a_1...a_k are letter σ_i, is

```   |(α)| = sum(j=1..k) |(a_j)| = sum(i=1..n) n_i * |(σ_i)|
```

On average, we have P(σ_i) = n_i/k, since that is the meaning of average probability of a letter σ_i. Hence, we have

```   |(α)| = k * sum(i=1..n) P(σ_i) * |(σ_i)|
```

To be independent of the length k of the word α, we therefore define the weighted code size average Size(P,) as

```   Size(P,) := sum(i=1..n) P(σ_i) * |(σ_i)|
```

To have minimal size code words (in average), we obviously have to minimize Size(P,), i.e., we have to find an encoding where Size(P,) is as small as possible.

Shannon's source coding theorem gives us now a lower bound for Size(P,) with the entropy H(P,Sigma) defined as H(P,Sigma) := - sum(i=1..n) P(σ_i) * log_2(P(σ_i)). In particular, it states that

```    (1) H(P,Sigma) <= Size(P,)
(2) H(P,Sigma) = Size(P,) iff |(σ_i)| = -log_2(P(σ_i))
```

You find the proof of this theorem on the slides. So, the weighted code size average Size(P,) cannot become less than the entropy H(P,Sigma), and the minimum H(P,Sigma) of Size(P,) is obtained with an encoding where |(σ_i)| = -log_2(P(σ_i)) holds.

Unfortunately, -log_2(P(σ_i)) is typically not an integer, so that we have to choose an integer nearby that real value. In that case, it can additionally be proved that

```   H(P,Sigma) <= Size(P,) < H(P,Sigma) + 1
```

This was already one of your questions. With a good encoding of code words of integer size, we may not be able to reach the minimum H(P,Sigma), but at least, we can achieve a weighted code size average Size(P,) less than H(P,Sigma) + 1.

Based on this theorem, we define the "information Inf(σ_i)" of letter σ_i as Inf(σ_i) := -log_2(P(σ_i)) which is measured in bits.

Now, we come to your question: What do we learn from slides 34-36? The question is here, which are worst-case alphabets for encoding, i.e., when do we have to choose longest encodings (σ_i)? By Shannon's source coding theorem, that means that these alphabets will have a maximal entropy. So, how big can the entropy be for an alphabet Sigma with n letters? To that end, we considered the function

```    h(x_1,...,x_n) := - sum(i=1..n) x_i * log_2(x_i)
```

and proved that this function has a maximum when x_i = 1/n holds. That means that alphabets are worst cases for encoding when the probabilities are the same, i.e., P(σ_i) = 1/n for each letter σ_i. In that worst case, we have

```    Hmax = - sum(i=1..n) P(σ_i) * log_2(P(σ_i))
= - sum(i=1..n) 1/n * log_2(1/n)
= sum(i=1..n) 1/n * log_2(n)
= log_2(n)
```

Hence, the larger the entropy, i.e., the closer the P(σ_i)'s are to each other, the greater is the "surprise" to see a particular letter σ_i. In other words, the smaller the entropy, i.e., the more different the P(σ_i)'s are to each other, the smaller is the "surprise" to see the one or the other letter. This is clear, since in the latter case, we have letters σ_i of high probability, and we are not surprised to see them. So, the entropy (the information) is a measure for the surprise to see a particular message.

So, lesson learned: Shannon showed us that there is a minimum for encoding and that minimum is the content of information in the data encoded. No compression algorithm can reduce that further, and therefore it does not make much sense to compress an already compressed file. There is a minimum size given by its information content that we have to accept.

The worst case is for example, where you need to encode all numbers from 0,...,2^{32}-1 where each number has the same probability 2^{-32}. In that case, every code word requires -log_2(2^{-32}) = 32 bits and that is what you typically find in computers. Such encodings are discussed in the next chapter by radix-2 and 2-complement numbers.

by (166k points)