Demystifying Analytics in eDiscovery
The facts you need to succeed and thrive in an electronic world
By Steven Toole / 2014 / Vice President of Marketing at Content Analyst
A recent eDiscovery Journal blog entry by Greg Buckles pointed out that the number of companies providing the analytics technology in early case assessment (ECA) and review platforms is few, and the specific analytics capabilities can vary wildly from one platform to the next. While this may not be breaking news to those closest to analytics in eDiscovery, it did reveal some mysticism about how the market uses analytics in the early stages of eDiscovery. In addition, it raised questions about how analytics fit into the eDiscovery workflow, and ultimately, the return on investment that analytics have on eDiscovery and information governance in general.
This overview is designed to further unveil this mysticism surrounding analytics in eDiscovery and information governance, and provide insights about the return on investment analytics can enable for those who embrace these capabilities. Corporate counsel that get ahead of the curve today with forward-thinking strategies such as these will be the ultimate heroes and beneficiaries of eDiscovery analytics, leading their field with a much more proactive and cost-effective approach to information governance and legal technology.
The Analytics Land Grab
While there are plenty of stakes in the ground across the eDiscovery landscape, law firms and service providers are looking “to the West” for unclaimed territory. The gigabyte gold rush is all about applying analytics to the data further upstream in order to “own” the data long before it’s needed in a matter. ECA was the first frontier law firms and service providers looked toward in order to move upstream, but the greenest pastures are still further upstream, in what some call “pre-discovery.” The Golden Rule in eDiscovery is simple: Those who “rule” the content get the gold. Translation: the vendor that can apply analytics earliest – before it’s needed in a matter – provides the most value to corporations, and therefore is at a great competitive advantage.
Recipe for Success
This all sounds good, but what does this really mean? You can’t bake a cake until you know what the ingredients are, what they do, and how they can affect the output. And since each eDiscovery solution has a different set of analytics capabilities, here’s a brief tutorial on the key ingredients of analytics for eDiscovery and information governance.
Dynamic Clustering – This is a good place to start, especially if you know little to nothing about the content. Clustering “buckets” the content (documents, emails, etc.) into natural groupings of conceptually related materials. One major benefit of clustering is that it provides a very fast map of the document landscape in a highly objective, consistent, concept-aware fashion. As a result, the reviewer can jump straight to the cluster that’s of most interest (conceptually relevant), and avoid spending time in clusters of no conceptual relevance. In terms of information governance, it can also help identify and weed out the ROT (redundant, outdated and transient) content quickly and easily, thus reducing costs and helping to increase ROI.
Term Expansion – You have a keyword, name or technical term, abbreviation, or acronym, and you want a list of all similar, or highly related, terms so you can expand your search for documents containing those terms as well. Term expansion identifies conceptually related terms, customized to your content, and ranked in order of relevance. For example, Barack Obama might produce a list such as President Obama, Commander-in-Chief, Senator Obama, Michelle Obama, the Oval Office, Office of the President, POTUS, etc. In a matter, that means finding more conceptually related content faster and easier, saving time and money. In information governance, it helps identify content related to corporate records, intellectual property, and compliance, as well as, of course, more ROT for defensible deletion.
Conceptual Search – You’ve identified a key document or paragraph, now you want to find similar ones. Keyword search will give you documents containing the specific keywords as best as you can write the Boolean search string, and as long as those keywords are included in the resulting documents. But writing Boolean search strings can be time-consuming and still may miss key documents containing the “unknown” terms not included in your search string. To find the documents you’d otherwise miss with keyword searches, you’ll need to use conceptual search. Applying mathematical algorithms to your example document or text selection, conceptual search looks for matching patterns in the “map” of the data called the conceptual space. The benefit is that conceptual search can find similar results even if the matching document doesn’t contain any of the same terms as the example text. Think abbreviations, misspellings, acronyms, code words, and related terms you hadn’t heard. Then take away false positives from synonyms and polysemes. Translation: uncover highly relevant yet latent relationships, saving time and costs.
Auto-Categorization – Predictive coding is one area that’s gained a lot of attention in eDiscovery over the past two years. Auto-categorization is what makes predictive coding possible. Predictive coding is applying machine learning to a corpus of documents to intelligently categorize them any number of ways, such as, as privileged, responsive, nonresponsive. For example, users can categorize documents as responsive, then categorize the responsive documents even further into relevant issue sub-categories. Auto-categorization uses the same conceptual space and sample document exemplars to find conceptually similar documents and label them as appropriate. Again, the big benefit here is a tremendous amount of time saving (and cost saving) by letting the technology bring the most relevant documents to the forefront, and into the hands of the domain experts, as quickly and easily as possible.
Email Threading – The concept of email threading is fairly simple – find the subset of emails at the end of each branch of a conversation thread. Rather than reading 30 emails back and forth – as well as sideways among forwarded branches of the conversation – email threading finds the subset of emails that include all of the previous replies (called “inclusive” emails because these six, for example, include the whole history). Time and cost savings of using email threading are self-evident, but it’s also important to note that threading reveals exactly who knew what, when – pretty critical in piecing together the course of events that unfolded surrounding a matter.
Near-Duplicate Identification – A similar benefit to email threading exists with near-duplicate identification. While conceptual search, clustering or categorization can identify documents that are relevant to the case, many could be various versions of the same document. Knowing that they’re near duplicates of each other can save the time of having to review each one. If it’s important to know what changed from one version to the next, when, and by whom, difference highlighting shows these changes, again saving time and reducing cost. Batching near duplicates together from the outset of a matter also provides reviewers with a more focused set of documents.
Putting Your Data on a Diet
Putting these analytics capabilities to work for you may cause serious weight loss in your data. In eDiscovery, that means fewer documents to review by expensive reviewers. It also means that the documents they are reviewing are nothing but the absolutely most conceptually relevant documents to the case. Moreover, reviewers are being presented with documents that are not batched haphazardly, allowing for a more focused review, driving accuracy to an all-time high and costs and time even lower.
But Wait – There’s More!
Remember the gigabyte gold rush from above? The Golden Rule of data? This is where analytics really are the key to unlocking all those hidden insights in a company’s data – long before they’re needed in eDiscovery. Applying text analytics to a company’s electronic records proactively through pre-discovery means that data is already organized, reduced, and ready to be presented if and when a matter arises. Corporate counsel love this idea because it keeps litigation costs as low as possible, decreases the crucial time it takes to investigate or decide whether to settle a case, and helps them present their side in the very best possible light.
The Bottom Line: ROI for eDiscovery Analytics
Measuring the ROI of using analytics in eDiscovery comes down to this: Review is the greatest cost factor in eDiscovery. Expert reviewers don’t come cheap – their expertise is clearly of utmost value in a case. But if a large percentage of their time is spent reviewing documents not relevant to a case, then you’re not getting the most value out of them in the first place. Their hours cost the same whether they’re looking at the smoking gun document in a case or something completely unrelated. You wouldn’t wear gloves during a palm reading – they’d just get in the way of the psychic doing her job. Presenting a corpus of documents to expensive reviewers without applying machine learning first makes no more sense. Further, finding documents otherwise missed without analytics can also hinder the experts’ ability to formulate your case strategy.
But the ROI of applying analytics to your documents and email pre-discovery goes even further. The cost benefits of organizing and analyzing your content proactively are huge, helping to drive decision making and information governance practices for compliance, risk mitigation and cost avoidance. Applying these strategies early can provide a tremendous advantage long term through a much more proactive and cost-effective approach to information governance and eDiscovery.