Probabilistic Data Structure they say - Bloom Filters

Written on April 20, 2025

Bloom filters are fascinating probabilistic data structures that provide an efficient way to test whether an element is a member of a set. In this post, we’ll explore what Bloom filters are, the problems they solve, how to implement them, and their real-world applications.

What are Bloom Filters?

A Bloom filter is like a super-efficient membership card for a set. Instead of storing all the actual items, it uses a clever trick to remember if it has seen an item before. Here’s what makes it special:

  • If it says “No, I’ve never seen this item” → You can trust it 100%
  • If it says “Yes, I might have seen this item” → It could be wrong (but usually isn’t)
  • It uses way less memory than storing all the actual items
  • It’s super fast to check if something might be in the set

The Problem Bloom Filters Solve

The Challenge: Membership Testing at Scale

Let’s say you’re building a cool new app. Every time someone tries to pick a username for their profile, we need to check if it’s already taken. How I would’ve done it back in the day:

  1. Check the database for every new username
  2. Wait for the database to respond

This works fine for small apps, but imagine if you have millions of users:

  • Every username check means a database query
  • Database queries are slow and expensive
  • Your database gets bigger and slower as you add more users
  • You’re wasting resources checking for usernames that are already taken
  • Users will be pissed off if they can’t use the username, even after waiting for a long time.

How Bloom Filters Help

Bloom filters solve this problem by acting like a super-fast pre-check:

  1. They use a tiny amount of memory (much less than storing all usernames)
  2. They can check if a username is taken in constant time (super fast!)
  3. They never miss a taken username (no false negatives)
  4. They might occasionally say a username is taken when it’s not (false positive)

Think of it like a super-efficient security personnel who:

  1. Only needs to glance at you to decide
  2. Never lets in someone who shouldn’t be there
  3. Might occasionally stop someone who should be allowed in (but this is rare)

Here’s a visual representation of how a Bloom filter works:

Bloom Filter Visualization

In this diagram:

  • The bit array represents our Bloom filter
  • Each element is hashed by multiple hash functions
  • The corresponding bits are set to 1
  • To check membership, we hash the element and check if all corresponding bits are 1

Simple Bloom Filter Implementation in Java

Let’s implement a basic Bloom filter in Java:

import java.util.BitSet;
import java.util.Random;

public class BloomFilter {
    private final BitSet bitSet;
    private final int size;
    private final int numHashFunctions;
    private final Random random;

    public BloomFilter(int size, int numHashFunctions) {
        this.bitSet = new BitSet(size);
        this.size = size;
        this.numHashFunctions = numHashFunctions;
        this.random = new Random();
    }

    public void add(String element) {
        for (int i = 0; i < numHashFunctions; i++) {
            int hash = hash(element, i);
            bitSet.set(Math.abs(hash % size));
        }
    }

    public boolean mightContain(String element) {
        for (int i = 0; i < numHashFunctions; i++) {
            int hash = hash(element, i);
            if (!bitSet.get(Math.abs(hash % size))) {
                return false;
            }
        }
        return true;
    }

    private int hash(String element, int seed) {
        random.setSeed(seed);
        return element.hashCode() ^ random.nextInt();
    }
}

This implementation:

  • Uses a BitSet to store the filter
  • Implements multiple hash functions using a random seed
  • Provides add() and mightContain() operations
  • Has O(1) time complexity for both operations

Example Usage and Output

Let’s see how the Bloom filter works with a practical example. We’ll create a filter to check for common usernames:

public class BloomFilterExample {
    public static void main(String[] args) {
        // Create a Bloom filter with 1000 bits and 3 hash functions
        BloomFilter bloomFilter = new BloomFilter(1000, 3);
        
        // Add some usernames to the filter
        String[] usernames = {"alice", "bob", "charlie", "dave", "eve"};
        for (String username : usernames) {
            bloomFilter.add(username);
            System.out.println("Added username: " + username);
        }
        
        // Test some usernames
        String[] testUsernames = {
            "alice",    // Should return true (definitely in set)
            "bob",      // Should return true (definitely in set)
            "frank",    // Might return false (not in set)
            "george",   // Might return false (not in set)
            "charlie"   // Should return true (definitely in set)
        };
        
        System.out.println("\nTesting usernames:");
        for (String username : testUsernames) {
            boolean mightExist = bloomFilter.mightContain(username);
            System.out.printf("Username '%s': %s%n", 
                username, 
                mightExist ? "Might exist in set" : "Definitely not in set");
        }
    }
}

On running this code, we see below output:

Added username: alice
Added username: bob
Added username: charlie
Added username: dave
Added username: eve

Testing usernames:
Username 'alice': Might exist in set
Username 'bob': Might exist in set
Username 'frank': Definitely not in set
Username 'george': Definitely not in set
Username 'charlie': Might exist in set

This example demonstrates several important aspects of Bloom filters:

  1. Definite Absence: When the filter says an element is not in the set (like “frank” and “george”), we can be 100% certain it’s not there.

  2. Probable Presence: When the filter says an element might be in the set (like “alice”, “bob”, and “charlie”), we need to verify with the actual storage if we need to be certain.

  3. Space Efficiency: We’re using only 1000 bits to represent 5 usernames, which is much more space-efficient than storing the actual strings.

  4. Fast Lookups: Each check is performed in constant time O(1), regardless of how many elements are in the filter.

To make this more practical, let’s add a method to calculate the actual false positive rate:

public class BloomFilterWithStats extends BloomFilter {
    private int elementsAdded = 0;
    
    public BloomFilterWithStats(int size, int numHashFunctions) {
        super(size, numHashFunctions);
    }
    
    @Override
    public void add(String element) {
        super.add(element);
        elementsAdded++;
    }
    
    public double calculateFalsePositiveRate() {
        // Using the formula: (1 - e^(-k * n / m))^k
        double exponent = -numHashFunctions * elementsAdded / (double)size;
        return Math.pow(1 - Math.exp(exponent), numHashFunctions);
    }
    
    public static void main(String[] args) {
        // Create a larger filter for better statistics
        BloomFilterWithStats filter = new BloomFilterWithStats(10000, 7);
        
        // Add 1000 random strings
        for (int i = 0; i < 1000; i++) {
            filter.add("user" + i);
        }
        
        // Test 1000 new random strings
        int falsePositives = 0;
        int tests = 1000;
        
        for (int i = 1000; i < 2000; i++) {
            if (filter.mightContain("user" + i)) {
                falsePositives++;
            }
        }
        
        double actualFalsePositiveRate = (double)falsePositives / tests;
        double theoreticalFalsePositiveRate = filter.calculateFalsePositiveRate();
        
        System.out.println("Statistics:");
        System.out.printf("Elements added: %d%n", filter.elementsAdded);
        System.out.printf("Theoretical false positive rate: %.4f%%%n", 
            theoreticalFalsePositiveRate * 100);
        System.out.printf("Actual false positive rate: %.4f%%%n", 
            actualFalsePositiveRate * 100);
    }
}

Output of this stat augmented version:

Statistics:
Elements added: 1000
Theoretical false positive rate: 0.6184%
Actual false positive rate: 0.6000%

This shows how close the actual false positive rate is to the theoretical prediction, demonstrating the reliability of the mathematical model we discussed earlier.

Understanding False Positivity Rate

The false positive rate is like the “mistake rate” of a Bloom filter. It tells us how often the filter might say “Yes, I’ve seen this” when it actually hasn’t.

Calculating False Positivity Rate

The chance of a false positive depends on three things:

  1. How big is your filter? (bit array size)
  2. How many items have you added? (number of elements)
  3. How many hash functions are you using? (number of checks)

The formula looks complicated, but here’s what it means:

P(false positive) = (1 - e^(-k * n / m))^k

Where:

  • k = number of hash functions (like how many times we check)
  • n = number of items we’ve added
  • m = size of our filter (in bits)

Making the Filter Work Better

You can tune your Bloom filter like adjusting a radio:

  1. More hash functions = Better accuracy but slower
  2. Bigger filter = Better accuracy but uses more memory
  3. Fewer elements = Better accuracy

Here’s a simple rule of thumb:

  • For 1% false positive rate: Use about 10 bits per element
  • For 0.1% false positive rate: Use about 15 bits per element

For example, if you want to store 1 million items with 1% false positives:

  • You’ll need about 10 million bits (1.25 MB)
  • Use about 7 hash functions

Real-World Applications of Bloom Filters

1. Google Chrome’s Safe Browsing

Think of Chrome’s safe browsing like a bouncer for websites:

  • Instead of checking every website against a huge list of bad sites
  • It uses a Bloom filter to quickly check if a site might be dangerous
  • If the filter says “might be dangerous”, Chrome double-checks
  • This makes browsing faster and safer

2. Apache Cassandra

Cassandra uses Bloom filters like a smart librarian:

  • Instead of searching every bookshelf for a book
  • It quickly checks which shelves might have the book
  • Only searches those shelves
  • Makes finding data much faster

3. Medium’s Article Recommendation System

Medium uses Bloom filters like a personal reading assistant:

  • Keeps track of what you’ve read without storing every article
  • Uses very little memory
  • Helps avoid showing you the same article twice
  • Makes recommendations faster and more personal

4. Bitcoin and Ethereum

Cryptocurrencies use Bloom filters like a privacy guard:

  • Helps verify transactions without downloading everything
  • Keeps your transactions private
  • Makes the network more efficient
  • Reduces the amount of data you need to download

Conclusion

Bloom filters are powerful data structures that provide an excellent trade-off between space efficiency and probabilistic accuracy. They’re particularly useful in scenarios where:

  • Space is a constraint
  • False positives are acceptable
  • Fast lookups are required
  • The dataset is large

While they have their limitations (false positives), their benefits in terms of space efficiency and speed make them an invaluable tool in many mid to large scale systems.