By Tyler Wells Lynch
The Institute for Experiential AI welcomed Dave Lewis, Ph.D., Executive Vice President for AI Research, Development, and Ethics at Reveal, to discuss advances in AI-assisted legal discovery, including prospects for other industries given the 20-year history of machine learning in law. The lecture is part of EAI’s Distinguished Lecturer Series.
“Electronic discovery” refers to the legal process of gathering evidence in the form of electronic documents, including emails, scanned papers, and digital archives. While the technology presents an interesting example of human-AI teaming, there are open questions regarding ethics, cost-efficiency, and research trajectories. Dave Lewis offered some of his thoughts on these issues in a question-and-answer session following his lecture with EAI. To learn more, you can read a written recap of the lecture or watch the video replay.
Q: Are there any studies on labeling errors in this e-discovery?
Dave Lewis: There have definitely been studies that show that you get large amounts of disagreement about what’s a responsive document, and that’s unsurprising. People don’t tend to worry about that very much in the sense that, at least intuitively, the belief is that the disagreements are on marginally relevant documents.
If you have a really important “smoking gun” document, everybody’s going to agree it’s a smoking gun document — assuming that they don’t miss it! Maybe one reviewer spotted that and one reviewer didn’t, but the usual assumption is that, if the appropriate material in a document was called to attention, everyone would agree that it was relevant.
This does implicitly have a big impact, because it calls into question the whole use of recall, for instance. People use recall on everything that’s defined to be relevant because there’s too few smoking guns to do any kind of statistical measure. Inevitably, you can’t statistically characterize finding the small number of unique documents. But that means then if you had two different people labeling the same random sample, one of them might come back that you have 57 percent recall, and one might come back that you had 89 percent recall — not at all impossible. The place where this has particularly affected evaluation is in these one-phase workflows, where you have a human being who’s iterating with the machine until they find what they think is most of the relevant documents. So you don’t have a classifier you’re going to evaluate; really what you’re evaluating is the human being. And then you’re going to evaluate one human being against another, and that then becomes very dicey, and that actually has led to the use of elusion, where you never actually evaluate the review decisions that got made; you evaluate the documents that got skipped, and you never actually looked to see how good your review decisions were.
Q: What are the open research issues where you think academia can help the discovery industry?
DL: My favorite issue is the interaction of evaluation and active learning. There’s a bunch of unsolved questions in evaluation, and one of the things that comes up is sequential bias. Let’s say you’re doing an iterative workflow and you’re deciding after each batch whether to stop. The fact that you’re conditioning your stopping decision on a statistic from a random sample introduces sequential bias into the estimate from the random sample. There are several problems there that have not been solved. I would say, though, that the sexy problem is active learning with transducers. If you’re fine-tuning transducers, that’s already a pretty tricky process, and if you’re now doing it in an active learning context: one, it’s computationally expensive; two, you’re now choosing data for some sort of leverage measures or something but there’s not a lot of information on how you tune to that, and your transducer was trained on 500 million web documents and now you have 200 email messages. There’s just a bunch of experimental work to do to sort out the tuning issues and to make it fast enough to run without having to have closeted GPUs at your legal service provider.
Q: Is it necessary to add the capability to unlearn past classifications to accommodate new rulings or core decisions? Do we need to forget the part of the past because of the future?
DL: That itself doesn’t tend to be an issue because it’s important to recognize that these classifiers have a very short lifetime. If you’re training for responsiveness in a particular legal case, when the discovery portion of that case is done, you don’t use that classifier anymore. If you’re, say, a pharmaceutical company you could easily have 200 lawsuits on very closely related issues. There is, then, an interest in training models for a particular topic, which would be reusable across multiple cases, but even there it’s not new legal rulings that are affecting that; it would be a new conception of the case, or it would be some new criteria you’ve negotiated with opposing parties.
I would say probably the more strong effect here is just from the beginning of the project to the end. As you’re labeling documents and as the case is progressing you’re learning about the case, and that itself may change your definition of what a relevant document was. It’s not unusual for somebody to do an iterative active learning process, but then go back and double-check the earlier decisions or the ones that disagree with a classifier. It’s not because anything in the world changed but because your conception of the case changed from your own experience.
Q: For the discovery involving audio, do we need more Natural Language Processing and Machine Learning than Information Retrieval and text mining techniques?
DL: Currently audio is a pretty small proportion. You’ll occasionally get cases where they’ve collected voicemail. I suppose you could imagine that as things go forward you’ll have more video that has audio components, but at least right now audio is not a big issue. If I was bringing up that kind of point what I would point to is Machine Translation (MT). One of the biggest issues in large cases that involve international firms is that you’ve got data in multiple languages, and maybe you’ve got only 1,000 documents out of a million in Chinese, but you can’t ignore those. There might be something important there. So there’s increasing interest in MT. MT’s not perfect, but it’s gotten a heck of a lot better in the past 10 years.
You can have a very interesting discussion when you’ve got a multilingual collection about what the most cost-effective and efficient way is to deal with that. Do you break it up? Do you train it together? Do you use MT?
Q: What role does the user interface play on this active learning system? Have you seen any interesting advances in this area?
DL: It’s a huge issue. One of the reasons that I went to Brainspace was that they had a real commitment to UI design and trying to co-design active learning methods and UI. That said, it is an underexplored area particularly in the academic research community. To the extent to which there’s academic research on TAR, it tends to focus on machine learning, active learning, statistical evaluation, and things like that. There’s been very little UI work but there are fascinating questions about it. For example, if you’ve got email threads in different legal circumstances the entire thread might be considered relevant if it has one message or it might not. You’ve got the contextual information; you may have a 500 message email thread, and you really need to see messages 47, 53, and 60 at the same time to really know that this is relevant. You’ve got issues with duplication and near duplication, which are lots of copies of documents that are minor variants of each other. We could use better interfaces for letting you manage that in a more effective way and understand when the differences are relevant.
About Dave Lewis:
Dave Lewis, Ph.D. is executive vice president for AI research, development, and ethics at Reveal. He has variously in his career been a corporate researcher, startup development team leader, research professor, freelance consultant, expert witness, and software company co-founder. In 2006, he was elected a Fellow of the American Association for Advancement of Science for foundational work in text analytics. In 2017 he received an ACM SIGIR Test of Time Award for his paper with Gale introducing uncertainty sampling.