Welcome! Here’s what I want you to take away from this article.
To calculate entropy of a word, use this simple formula:
The p_i is ratio of character occurences to word length and n is the count of unique letters. Sample implementation here.
Non-secret text hovers around 4, truly random strings (compressed, encrypted, etc.) start at 5 and top up at 8 for ASCII texts.
If you are working with source code, there’s some overlap between secrets and non-secrets leading to false-positives. Tokenize the text properly according to the grammar of the language.
Definitely go read the seminal paper A Mathematical Theory of Communication (C. E. Shannon, 1948) as well! It's a piece of history, basically the cornerstone of cryptography.
The rest of this article explains the facts above and provides more examples and intuition.
In the previous article, we’ve explored secret detection in breadth, covering various topics and approaches. But I’d like to do a deeper dive on entropy and of course the seminal paper A Mathematical Theory of Communication (C. E. Shannon, 1948). Time to put my mathematician hat back on.
Dissecting The Paper
Let’s recall the core concept - entropy. The entropy tries to measure the overall uncertainty within the data. The difficulty of predicting the next character in the sequence increases with entropy. Another way of looking at this is - if we had the best possible encoding, what is the shortest amount of bits we still need to send?
High entropy strings in text are most likely secrets. Of course, not all high entropy strings are secrets - this approach is not immune to false positives. Be careful with the other case: not all secrets are high entropy. This was a lesson learned from the CTF where you’ll be detecting placeholders and flags.
Definitions
Please refer to section 6 in the paper. Let's agree that a word is a sequence of characters from an alphabet. Let's agree we don't care about letters that don't appear and can never appear in the word, so the alphabet is the set of letters present in the word we analyze.
Given a word, we can measure how often each character appears. This gives us a distribution of probabilities for that word and alphabet.
Now what we want to know, is the next character in the sequence. Examples:
word aaaaa, probability of a is equal to 1. Trivially, the next character is a. We'd like for our function to return some small value for this (zero, even).
word aaaabaabaaabaaa, probability of a is 5/6, probability of b is 1/6. Here, we want to have a higher value, but not too high.
word 137e5a7da48c5c7aac6a8cb8959e63b5, probability of each digit and a to f is 1/16. The theoretical maximum over a small alphabet, the entropy is high here. Yes, it's an MD5 hash.
If this reminds you of LLMs, great catch. With LLMs we've managed to do so much statistics on the natural language and it's texts, that we can "predict" the next character given the context we have. And really, by predict I mean that the human eye doesn't find the choice of the next characters weird. There is no truth embedded in the language, only words. I wonder if the entropy of prompt correlates with the entropy of the output?
Let’s say that H is the function calculating entropy for the given text. What’s weird about it is that it takes as many parameters as necessary. Formally it takes the character distribution (set of probabilities of given characters appearing). Trivially, extending the alphabet extends the input list for this function. Going back - nothing really happens if the probability of the character appearing was zero.
Properties of Entropy
Formalizing the requirements of this measure H we're after:
Continuous function - duh, we don’t want any holes or skips.
For equal distributions with n characters and each probability equal to 1/n (like 3rd example), H should monotonically increase with larger n. This translates to the observation that we can encode more information with a larger alphabet for a given length of word.
I don’t have witty remarks here. To me it resembles a transitive property, but I don’t dare calling it that. Let's say we have p1 = 1/2, p2 = 1/3 and p3 = 1/6. There's a 1/2 probability that the next character will be one associated with probability p2 or p3. If that happens, we have new probabilities in the smaller alphabet - for p2 = 2/3 and for p3 = 1/3. So we need that:
Shannon proves that the only function H with these desired properties is the one he found:
Moreover, in the paper he deduced the following properties:
If we have a single character with probability 1, we are certain, so entropy is zero (example 1).
The maximum entropy for given alphabet of length n is with uniform probabilities equal to 1/n (example 3).
The triangle inequality:
\(H(x,y) \le H(x) + H(y) \)with equality if x and y are independent.
There are couple more properties that didn’t strike me as particularly intuitive, so if you are interested in more detail still, go read the paper as I would be only repeating it’s arguments at best and butchering them at worst.
Implementation
Using the formula above, we can get this entropy based detector in python:
#!/usr/bin/env python3 | |
from math import log | |
SECRET_THRESHOLD = 4 | |
FILENAME = "test.py" | |
def word_entropy(s: str): | |
"""Calculates Shannon entropy for the word *s*""" | |
counter = dict() | |
for c in s: | |
if c not in counter: | |
counter[c] = 0 | |
counter[c] += 1 | |
# frequence of characters as ratio | |
freqs = [counter[i] / float(len(s)) for i in counter] | |
return -1 * sum([f * log(f, 2) for f in freqs]) | |
def find_secrets_in_text(text, secret_threshold): | |
"""Returns words from *text* that have Shannon entropy above the *secret_threshold*.""" | |
words = " ".join(text.split("\n")).split(" ") | |
secrets = [w for w in words if word_entropy(w) > secret_threshold] | |
return secrets | |
if __name__ == "__main__": | |
filename = FILENAME | |
with open(filename, "r") as f: | |
data = f.read() | |
for s in find_secrets_in_text(data, SECRET_THRESHOLD): | |
print(s) |
Yes, I know that much more can be done in regards to actual file scanning. But here I focus on the entropy aspect, so a single file will do.
Threshold value tuning
We can compute the entropy of a word, but where do we draw the line to differentiate secrets? You’ve surely noticed the SECRET_THRESHOLD
constant at the top of the file.
Recall that the theoretical maximum is achieved when the probability of each character appearing is uniform. We can now take a look at some of the common character sets we encounter:
Our text will be somewhere in the ASCII printable charset, so we have an upper bound. Now we need a baseline - how much entropy is in a normal english text or a normal source code file. Fortunately, Claude Shannon delivers once again in the Prediction and Entropy of Printed English paper, where the entropy of English language has the upper bound estimate of 4.
We are interested in the source code though and here I wasn’t so lucky. This notebook by Tomek Korbak to me seemed like the best source - let’s check his table of example entropy for various languages:
Eyeballing the table gives us 5.25 as the upper bound estimate for the source code as opposed to approx 4 for plain English.
You probably see the issue if you compare the two tables. Even if we had the most random string but composed only with the numbers (e.g. a hard-coded PIN) we might not catch it against the source code information. The case is similar with base64 - a drop of randomness + code complexity and we’re sad.
Given these data, what we can do is setting the threshold to 5.3 and pray. Or, we can try for a “curve” approach - returning the top 1% of strings with highest entropy (but that will have false positives - every file would report secrets).
A better practical approach works with tokenization - the source code differs from written text and you need to parse it into words that make sense. As a simple example, consider this simple function call that can be found in Java:
System.out.println("=== "+keyTitle +", "+key.getAlgorithm()+", "+key.getEncoded().length+" length,"+key.getFormat()+" ===")
This string scores a respectable 4.59 entropy and might cause issues when fine-tuning the secret threshold. But if we break it into tokens by separating at pluses and dots, the maximum entropy of a token in this sequence is around 3.25.
What I am saying is, this is no silver bullet and there’s still work to be done.
Running the Script
So, let’s try our script with a simple test file, shall we?
#!/usr/bin/env python3 | |
from api_client import Client | |
if __name__ == "__main__": | |
username = "name.surname@whatever.com" | |
password = "I7Xr+HMvS9fjii5GTa8IZLH4yaP4B2UDIPAg9lgh3u4=" | |
pin = "336351" | |
client = Client() | |
client.login(username, password) | |
data = client.get_data() | |
return 0 |
I’ve generated the password using this bash one-liner:
cat /dev/urandom | head -c 32 | base64
Our secret in the test file has an entropy of 5.012, so the 5.3 threshold didn’t detect it. Going from the other direction and setting the threshold to 4 worked much better - this secret was the only string returned, so no false positives. I’d say that 5 is a good start and then tune up or down depending on your data.
As expected, we weren’t so lucky with the PIN. With the entropy of 2.15, only 5 words have lower values (“if”, “pin”, “data”, “from” and “__name__”), so we would see pretty much the whole file as false positives.
Since I did the OWASP WrongSecrets, I’ve calculated the entropy of each secret I’ve found:
As you can see, my simple algorithm would be pretty SOL on this dataset. Only 6 secrets above 4.5 - these I am confident I could detect (well, if they would actually be in the sources and not in the clouds and vaults).
Key Takeaways
I’ll let you be the judge if I managed to explain these takeaways. Starting with the formula to calculate entropy:
The p_i is ratio of character occurences to word length and n is the count of unique letters. Sample implementation here.
Non-secret text hovers around 4, truly random strings (compressed, encrypted, etc.) start at 5 and top up at 8 for ASCII texts.
If you are working with source code, there’s some overlap between secrets and non-secrets leading to false-positives. Tokenize the text properly according to the grammar of the language.
Definitely go read the seminal paper A Mathematical Theory of Communication (C. E. Shannon, 1948) as well! It's a piece of history, basically the cornerstone of cryptography.