Miloslav Homer




JA3/JA4 - TLS Client Fingerprinting

Part 1: Principles and Timeline


Are bots crawling, scraping or attacking your website? Bots today rotate IP addresses and spoof User-Agents, so let’s add TLS Fingerprinting using JA4 to identify and restrict them.

Both JA4 and its predecessor JA3 (broken since Chrome randomized the order of TLS extensions) inspect TLS ClientHello (extensions, ciphers, SNI, ALPN) to calculate a hopefully unique hash.

JA4 TLS fingerprint strengths:

The JA4+ suite extends fingerprinting to SSH, ServerHello, HTTP, QUIC, and more. These additional methods are patent pending and licensed under the FoxIO license, restricting some commercial uses.

It's not a silver bullet though:

  • Collisions in the DB: top 10 hashes account for 50% of the DB,
  • Attackers can mimic valid clients,
  • Encrypted client hello (ECH) breaks passive PCAP analysis.
  • New RFCs might demand JA4 modification.

Think of JA4 as a supplement to the classic analysis of IPs, ASNs, User-Agents.

Let's dive into the details, starting with setting the stage.

Identifying/Fingerprinting HTTPS Clients

There’s a cat and mouse game being played between the servers and the clients. Servers need to identify clients to block malicious probes, mitigate scraping, or enforce policies, while clients (from privacy-conscious users to malware) often prefer to blend in. Both sides have merit. Obviously something has to give.

In this article, we’ll side with the servers to try and find techniques how to identify/fingerprint the clients who might not want to be identified.

If you’d like a short list of reasons for motivation:

  • attacking probes usually want to avoid detection,
  • scrapers/scalpers can overwhelm your site,
  • requiring human touch for particular actions like registration (in this case the captcha/phone number is a better check though).

SSL/TLS Handshake

HTTPS or secure HTTP is so common right now, it’s barely worth a mention. It’s an encrypted channel that protects most of the traffic on the internet. But first, you have to establish this channel and to do that you’d use so-called handshake.

You probably know the three-way TCP handshake or the 7 layer networking model. This one is similar, but at the application layer (no. 7).

ssl_tls_handshake
Yes, taken directly from my BSides presentation on this topic.

Several things happen. Client reaches out to server with the ClientHello message. Note that server doesn’t influence the client hello in any way as this is the first message.

Server then provides info about its capabilities and its certificates. This is done because clients need to verify they’re connecting where they’re supposed to without any men/women/other in the middle. Note that the server doesn’t verify the client1.

Then they do a key exchange protocol. The basis is the Diffie-Hellman Key Exchange. We’re past that, there are other crypto manoeuvres you can perform like DH but with elliptic curves, DH but with post-quantum algorithms or just remembering the previous keys.

The core idea is then simple - different clients could have different Client Hello payloads. And since the server doesn’t influence them, they should be consistent across connections.

JA3 (2017)

The idea of using Client Hello data to fingerprint servers is somewhat recent. I’ve found this blogpost from 2015 which seems to be the oldest reference. I suspect this is because you’d need ubiquitous HTTPS to make this work, so that had to happen first.

JA3 has origins in Salesforce, namely we thank John Althouse, Jeff Atkinson, and Josh Atkins (add the initial letters of names).

Simply put (quoting the original blog post):

JA3 gathers the decimal values of the bytes for the following fields in the Client Hello packet; SSL Version, Accepted Ciphers, List of Extensions, Elliptic Curves, and Elliptic Curve Formats. It then concatenates those values together in order, using a "," to delimit each field and a "-" to delimit each value in each field.

Or, in another language:

def ja3(c_hello):
	t = str(client_hello.tls_version)
	c = "-".join([str(x) for x in c_hello.ciphers])
	ex = "-".join([str(x.id) for x in c_hello.extensions])
	ec = "-".join([str(x) for x in c_hello.elliptic_curves])
	pf = "-".join([str(x) for x in c_hello.ec_point_formats])
	in_str = f"{t},{c},{ex},{ec},{pf}"
	return hashlib.md5(in_str.encode()).hexdigest()

Sounds simple right? It’s not. There are so many RFCs like RFC8446 defining these structures and protocols. Parsing all of the needed extensions supporting the full historical capabilities does take some time to get right.

The problem then becomes that of getting these data from the packets to that algorithm. I’ve re-implemented the JA3 from spec and I’ve also implemented PCAP parsing (even with libraries it’s still work). Let me tell you, it’s not pretty. But it works.

Problems with JA3

Over time, internet standards evolved which interfered with the JA3 algorithm.

GREASE

The first wrinkle was the so-called GREASE from RFC8701. It expands to “Generate Random Extensions And Sustain Extensibility” and it’s basically a way for clients to troll the servers/routers with useless values. The added value of this approach is that of a more-resilient ecosystem as servers/routers are forced to comply with standards properly.

In a nutshell, there are some reserved values that could be randomly inserted, but should be ignored. Here’s the list:

GREASE_LIST = [
    0x0A0A, 0x1A1A, 0x2A2A, 0x3A3A,
    0x4A4A, 0x5A5A, 0x6A6A, 0x7A7A,
    0x8A8A, 0x9A9A, 0xAAAA, 0xBABA,
    0xCACA, 0xDADA, 0xEAEA, 0xFAFA,
]

So JA3 implemented a simple fix to ignore these values. Fixed, yay!

Chrome TLS Extension Permutation (2022)

Keen eyes might have noticed that there is no sorting performed on the TLS extensions. There’s a natural sorting order as each extension has a numerical ID.

Keener eyes at Google have realized that the RFC8446 standard doesn’t require any particular order (see page 38). So, just like that they’ve added a feature to Chrome that randomizes that order. The claimed benefit is that of a more robust ecosystem.

Look, I need to say that Google has an interest to keep some bots unfingerprintable - without crawling the whole internet, there’s no Google.

Other Issues

At this point, the authors were aware of other minor drawbacks to the scheme, namely:

  • MD5 hashes don’t show the full information,
    • You can’t compare partial hashes,
  • Inability to evolve the hashes with new TLS extensions,
  • Resuming session has a different hash than a new session for the same client as per RFC5077,
  • Probably plenty other minor issues.

So instead of breaking backwards compatibility with JA3 by sorting the extension in response to the Chrome feature, they’ve decided to just publish a new major version, abandoning JA3.

JA4 (2023)

If you’re asking about a fourth person with the JA initials, I have good news. We now have John Althouse, Jeff Atkinson, Josh Atkins and Joshua Alexander. There are plenty of additional contributors listed, but I’ve found this detail funny.

This is a complete rework of the scheme. Pretty much the only thing they have in common is that they use client hello to fingerprint the client. The summary is best shown in the picture they’ve provided:

JA4
JA4 Overview (by FoxIO)

It’s actually a suite of fingerprinting methods. We can try to identify servers, SSH connection, HTTP requests by headers and whatnot and plenty others that are not the focus of this article.

Nowadays, JA4 (and to a lesser extent JA4+) is widely supported by many vendors and open source technologies (check the official list).

John is very motivated to keep this list up-to-date, because of the licensing. The full license details can be found in the repo. The JA4 that is the focus of this article is licensed under BSD-3 while the rest of the suite is licensed under FoxIO License 1.1.

I’m not a lawyer (if you are, or need to be, you have to read the licenses above, sorry), so here’s my quick’n’dirty summary. You can use JA4 to defend your own assets, but you can’t use it to sell protection (e.g. CloudFlare should be paying).

And as a cherry on top, there is a JA4 database that got recently updated. This is a huge upgrade, the DB grew by 325% making it so much more useful.

Issues With JA4

Of course, JA4 is not a silver bullet that would solve all of our fingerprinting needs straight away.

JA4 Collisions Recorded in The DB

Keeping the focus on the DB, the increased volume is also a weakness. I ran a sophisticated data analysis pipeline:

cat ja4db_new.json | jq | grep '"ja4_fingerprint"' | \
cut -f2 -d':' | tr -d ',' | tr -d '"' | \
sort | uniq -c | sort -n | \
sed 's/  */ /g' | sed 's/^ //' | sed 's/  *$//' | tr ' ' ',' \
 > ja4db_counts.csv

Yes, it counts unique JA4 TLS hash occurences. Let me just paste the top 10 hashes:

count,ja4
35043,t13d1516h2_8daaf6152771_d8a2da3f94cd
28085,t13d3012h2_1d37bd780c83_b26ce05bbdd6
17830,t13d171300_5b57614c22b0_43ade6aba3df
17153,t13d1516h2_8daaf6152771_02713d6af862
10892,t13d1712h2_5b57614c22b0_ef7df7f74e48
9809,t13i170800_5b57614c22b0_97f8aa674fd9
8568,t13d131000_f57a46bbacb6_e7c285222651
8543,t13d170900_5b57614c22b0_97f8aa674fd9
8149,t13d0911h2_f91f431d341e_3fcd1a44f3e3
7448,t13d2212h2_231e334592e8_36bf25f296df

That means out of 314783 entries, approximately half (151520) are captured in the top 10 hashes. There is, of course, more to it, so I’m working on a follow-up where I work through this properly.

Evasion and Mimicry

Another issue, (this time unsolvable) is that the attackers have full control over the client hello properties. That means they can mimic the browsers or strip down the extensions to the bare minimum.

There are several such solutions available on the internet, I’m sure you can find and deploy one. Before you start complaining, I think attackers should work for their attacks and today I'm not helping them.

Combined with the DB collisions, the usual tricks (ASNs, User-Agents, behavioral analytics) are still required to properly fingerprint clients.

PCAPs and Encrypted Client Hello

A word of caution. If you’d like to deploy this via passively reading traffic (e.g. by storing it into PCAPs) beware that Encrypted Client Hello (ECH) is now released and widely supported. Implementing JA4 with ECH enabled is annoying. In a sense it’s straightforward, but the amount of technical detail you have to get right to insert yourself between the TLS handshake and HTTP server at scale is a lot.

The Next Round of Cat and Mouse

I think we can expect more modifications to JA4 as the standards evolve in time. In fact, they might already exist, but are not published for the general public. And you know what? I respect it. This is a hard cat and mouse game to play so keeping some cards hidden is an advantage.

See you in the logs.


1.

Actually, servers can also use the certificates to verify the clients - this is called mutual TLS (or mTLS). I think this is one of the better way to connect service-to-service communication, but it seems that’s just me.

Back

Return to blog, Return to top