From History to Practice: How We Find Privileged Communications
In our last post, we traced the history of AI — from handcrafted rules to today’s large language models — and concluded that while GenAI is powerful, older approaches are still highly relevant. Now, let’s look at how that principle shows up in a real tool we’ve built at Lineal, and why choosing the right approach leads to solutions that are faster, cheaper, and more transparent — while laying the groundwork for future AI enhancements.
Privilege Review: The Challenge
Privilege review is one of the most critical — and expensive — parts of eDiscovery. Our goal in Amplify™ PrivFinder is to help reviewers quickly identify potentially privileged communications so they can prioritize their review efforts.
At first glance, this is a classic binary classification problem: a document is either privileged or not. Today’s large language models could certainly take a shot at solving this problem. You could imagine sending the entire text of an email into a model and asking, “Is this privileged?” And in many cases, it might get the answer right.
But classifying documents this way would be slow, expensive, and ultimately brittle — the model would need to reason over a lot of irrelevant text, and we’d have little transparency into why it made its decisions. Instead of rushing to the newest tool in the toolbox, we stepped back and asked: what’s the right way to frame this problem?
Framing the Problem the Right Way
Privilege status is less about what was said and more about who was communicating. A purely document-level approach, even with sophisticated ML, risks being noisy — language in privileged documents can look extremely similar to non-privileged ones. Training a model to recognize privilege from content alone would require expensive labeling and might still fail to generalize well across matters.
Since much of eDiscovery involves email communication, we decided to focus here: who sent an email, who received it, and what roles those people play. This reframing shifts us from thinking about documents as standalone text to thinking about communications as interactions between actors — and it naturally leads to one of the most information-rich parts of an email: the signature block.
Why Signatures Are the Key
Identifying the actors isn’t as simple as extracting a name — we also need to understand the role that person plays. Knowing that “Jordan Smith” wrote an email is less helpful than knowing that Jordan Smith is “Senior Legal Counsel.” That’s what turns a random name into a strong privilege signal.
This is why signatures are so important: they are the part of the email where people explicitly tell you who they are and what they do. A well-structured signature line often contains a name, title, department, and company all in one place. In other words, it’s the most information-dense part of the email for determining an actor’s identity and role.
Step 1: Identifying Signature Blocks
The first step, then, is to reliably find those signatures. This step alone could be treated as a machine learning problem, and we did experiment with ML-based classifiers. What we found was striking: our carefully constructed heuristics performed just as well, with far less compute cost and far greater interpretability.
Our rules are intuitive: signatures usually appear at the end of emails, contain names and titles, and have shorter line lengths. To make these rules more robust, we also incorporate corpus-level statistics — a lesson drawn from the “statistical turn” we discussed in our first post. If a line of text appears identically across hundreds of emails, it’s probably a signature line. By blending heuristics with these statistical measures, we’ve made our rules-based system stronger and more generalizable.
Step 2: Finding Privileged Actors
Once we know where the signature is, we run a keyword search for job titles and legal terms that indicate privileged roles. Here, we intentionally keep the process transparent: the list of keywords is fully visible and editable by reviewers.
This transparency is a major advantage over opaque machine learning models. If a reviewer notices that a term like “legal counsel” should be added, they can simply add it — no retraining required. This not only builds trust but also allows users to continuously refine the system for their specific matter.
Step 3: Producing More Accurate Results
The combination of steps dramatically improves accuracy. By narrowing our search to signature blocks rather than scanning the entire email body, we avoid irrelevant hits and dramatically cut down on false positives.
For reviewers, this means something tangible: more accurate results, fewer wasted clicks, and less time spent filtering out irrelevant documents. Instead of wading through false positives, they spend their time on what actually matters — the communications most likely to require legal review.
Why This Matters
This approach embodies the philosophy we outlined in our first post:
- Decompose the problem into clear, solvable subproblems.
- Apply the simplest effective technique to each subproblem.
- Use statistical insights to make rule-based methods stronger, not obsolete.
The result is a system that is scalable, explainable, and cost-effective — one that performs as well as more computationally expensive machine learning solutions without their downsides.
Laying the Groundwork for the Future
Just because we didn’t start with LLMs doesn’t mean we aren’t thinking about them. In fact, this approach sets us up to use them more effectively. A signature-finding system gives us a cleaner signal to feed into more advanced models. Instead of throwing an entire noisy email at a large language model, we can target just the key parts — signatures, sender metadata, or specific communication clusters — making future applications of small or large language models faster, cheaper, and more accurate.
Takeaway
The lesson here is clear: newer isn’t always better — but thoughtful use of simpler techniques can set the stage for next-generation solutions. By combining heuristics, corpus-level statistics, and user-editable keyword search, we’ve built a tool that delivers accurate, transparent, and scalable privilege review today, while opening the door to exciting GenAI-powered features tomorrow.
Missed Part 1? Read “A Short History of AI: Why LLMs Aren’t the Whole Story” to see how the evolution from rules to GenAI shaped our thinking—and why the smartest tools aren’t always the newest.
_
About the Author
Matthew Heston is Lead Data Scientist at Lineal, where he leads the design and implementation of AI, machine learning, and scalable data systems that transform how legal teams work with complex information. He received a PhD in Technology and Social Behavior from Northwestern University.
_
About Lineal
Lineal is an innovative eDiscovery and legal technology solutions company that empowers law firms and corporations with modern data management and review strategies. Established in 2009, Lineal specializes in comprehensive eDiscovery services, leveraging its proprietary technology suite, Amplify™ to enhance efficiency and accuracy in handling large volumes of electronic data. With a global presence and a team of experienced professionals, Lineal is dedicated to delivering custom-tailored solutions that drive optimal legal outcomes for its clients. For more information, visit lineal.com