Why Are Lawyers Afraid of TAR?

How I’ll Spend My Summer with ACEDS

ACEDS has been busy all summer keeping us trained with a series of webinars on breaking eDiscovery topics and deep dives. Many experts have been contributing their time and knowledge into their specialties. It’s been a great turnout so far, so I thought I would focus on what’s to come for the rest of the summer. Print this out as a handy reference so you don’t miss one of the upcoming classes that can help you grow, learn and earn valuable CEDS credits.

Wednesday, July 25th, 2018 – 1-2 PM ET – Demystifying EMRs and eDiscovery: A Day in the Life of…

Tiffany McLauchlin-Kingsley, CEDS, a Certified eDiscovery Specialist with over 15 years of experience in the IT industry, including over 8 years of sales consulting in the Patient Privacy Monitoring, eDiscovery and Information Governance space, will be talking about electronic medical records and eDiscovery. Tiffany is certified in forensics, HIPAA security, HIPAA privacy, Digital Forensics and Electronic Evidence and Cerner Millennium Fundamentals. Register for the webinar here.

In today’s healthcare world Legal, eDiscovery, Privacy and Compliance professionals are burdened with the painstaking process of creating a story from EMR (electronic medical record) Access Reports in preparation for litigation. The entirety of the process can be daunting from the time to collect the reports from your HIM/Data Security/or IT Team(s), to reformatting the thousands of lines in a CSV or Excel document, to then make sense of the data to inform Counsel of what exactly did or did not happen. The limitation of clinical context included in the EMR Access Reports leave major gaps that are left unexplained. The gaps cause the integrity of the data to come into question with the Judge or opposing Counsel. Thus, an exuberant amount of money could be paid out on cases where it could not be proven beyond a reasonable doubt, that an access did or didn’t happen for a valid reason effecting a patient’s outcome.

Learn what it’s like from the seat of a Certified eDiscovery Professional who has lived through this process in a large healthcare organization with over 80,000 employees with a case load of over 1,500 concurrent cases. This discussion is relevant to a number of cases, including Medical Malpractice, Wrongful Death, Wrongful Termination, Harassment, Employees as a Patients and Identity Theft. Learn how you can change the landscape with the latest patented, machine-learning technology in the market today.

Thursday, July 26th, 2018 – 1-2 PM ET – Artificial Intelligence and Machine Learning: The Future of eDiscovery

Amie Taal, consultant with the UK’s Strategem Tech Solutions and former VP of Deutsche Bank is an internationally regarded expert in Forensic Investigations, Cyber Security, eDiscovery, Data Analytics and Artificial Intelligence and will be sharing her expertise with us. Register for the webinar here: Artificial Intelligence and Machine Learning: The Future of eDiscovery


This webinar will cover:

  • What is AI and Machine Learning?
  • The Benefits and Risk
  • How AI and Machine Learning will change the practice of eDiscovery in the near future.

Tuesday, August 7th, 2018 – 1-2 PM ET – School’s out – Let’s stay connected – the Internet of Things

David Greetham, expert witness, patent holder and eDiscovery Business Unit Leader at Ricoh America will discuss the Internet of Things (IOT). David is Vice President of eDiscovery Sales and Operations at Ricoh USA, Inc., responsible for driving Ricoh’s computer forensic and electronic discovery services strategies and growth in the U.S. Greetham joined Ricoh from HSSK Forensics, which Ricoh acquired in February 2012. Renamed Ricoh Forensics, Greetham oversees the division, which operates as the nation’s first private computer forensics lab accredited by the American Society of Crime Laboratory Directors/Laboratory Accreditation Board (ASCLD/LAB).

Throughout his career, Greetham has been retained by law firms and corporations as a Testifying and Consulting Expert in the area of Computer Forensics and is a Certified Forensic Litigation Consultant by the Forensic Expert Witness Association (FEWA). Additionally, he has been responsible for forensic examinations of computer systems involved in theft of trade secrets, inter- net misuse, harassment, murder, fraud and alleged spoliation.

This session will review the challenges of collecting and processing ESI from connected devices (The Internet of Things) and how traditional methods won’t work. Learn more and register for the webinar here: School’s out – Let’s stay connected – the IoT

Wednesday, August 8th, 2018 – 1-2 PM ET – Importance of Good Cyber Hygiene

Cybersecurity expert, CEO and author of Hacked Again, Scott Schober, President and CEO of Berkeley Varitronics Systems (BVS), a 45-year-old New Jersey-based privately held company and leading provider of advanced, world-class wireless test and security solutions, takes the audience through best security practices without overwhelming technical minutiae. Scott is a highly sought-after author and expert for live security events, media appearances and commentary on the topics of ransomware, wireless threats, drone surveillance and hacking, cybersecurity for consumers and small business and emerging blockchain technologies. He is often seen on ABC News, Bloomberg TV, Al Jazeera America, CBS This Morning News, CGTN America, CNN, Fox Business and many more networks. His security advice can be heard on national radio networks including NPR, Sirius XM and Bloomberg Radio. Scott regularly presents at tech and security conferences discussing wireless technology and its role in breaches along with his vision for best practices to stay safe in the future. Scott is an advisor to BlockSafe Technologies.

Scott prepares small business owners, employees and general consumers for any cyberthreat by detailing simple but effective strategies including password maintenance, phishing awareness, ransomware defenses, social engineering tricks and much more. Learn more and register for the webinar here: Importance of Good Cyber Hygiene.

Thursday, August 9th, 2018 – 1-2 PM ET – Ethics & Social Media: Walking the Fine Line

Legal ethics in advertising have long been an intrinsic part of the legal environment.  It has always been clear that advertisements by attorneys cannot contain a material misrepresentation of fact or law and cannot be false, misleading or deceptive.  But with the onslaught of social media, the landscape has changed and the risks have increased.

How do attorneys address this new medium, particularly when trying to market their practice?  What is the line between business and personal use of social media?  How can you keep your personal Facebook page ethically sound? What rules must be met when a lawyer has his own blog? What can you really put on your website?

Tom O’Connor, author of eDiscovery for the Rest of Us, an extensive article written with the smaller firm in mind and Electronic Discovery for Small Cases, published by the ABA, will be your host along with yours truly and we’ll discuss the landmine of ethical concerns that swarm the marketing landscape that you need to be aware of in order to be technically competent to address them. Register for the webinar here: Ethics & Social Media: Walking the Fine Line.


Wednesday, August 15th, 2018 – 1-2 PM ET – Managing the Minutiae of Document Productions

Kelly Griffith will discuss the resources needed to practice more efficiently, improve client service, and add more value. Kelly is a Senior Legal Editor responsible for eDiscovery resources available through Thomson Reuters’ Practical Law service. Practical Law has become a leader in providing comprehensive and concise resources to attorneys in all areas of legal practice, and Kelly focuses her time on developing and maintaining resources related to the handling of electronically-stored information.

Prior to joining Practical Law, Kelly spent 10 years as a general civil litigator and two years as eDiscovery Counsel and Director of Litigation Support for a regional law firm. In those roles, Kelly managed eDiscovery in a variety of cases, from small state court actions to large, multi-state federal class actions. Learn more and register for the webinar here: Managing the Minutiae of Document Productions.

Tuesday, August 28th, 2018 – 1-2 PM ET – Negotiating a Forensic Examination Protocol

During this webinar, attorney, forensics expert and special master, Craig Ball will talk about the goals and needed protections attendant to a computer forensic examination and some of the crucial terms and conditions that should be resolved by a thoughtful examination protocol. Craig is a special master computer forensic examiner, law professor and noted authority on electronic evidence and eDiscovery. He limits his practice to serving as a court-appointed special master and consultant in computer forensics and electronic discovery and has served as the Special Master or testifying expert in computer forensics and electronic discovery in some of the most challenging and celebrated cases in the U.S. A founder of the Georgetown University Law Center E-Discovery Training Academy, Craig serves on the Academy’s faculty and teaches Electronic Discovery and Digital Evidence at the University of Texas School of Law. For nine years, Craig continues to pen the award-winning Ball in Your Court column on electronic discovery.  Register for the webinar here: Negotiating a Forensic Examination Protocol.

Wednesday, August 29, 2018 – 1-2 PM ET – What Can Happen when ESI is Lost and How to Minimize or Avoid Sanctions should ESI be Lost

Join Ronald J. Hedges, J.D., Senior Counsel with Dentons US LLP, and Mary Mack, ACEDS Executive Director, as they cover the following topics:

  • The 2015 amendments: An overview
  • Cooperation and proportionality
  • The duty to preserve and the litigation hold
  • Spoliation of physical evidence and spoliation of ESI under amended Rule 37(e)

Ronald J. Hedges, J.D. served as a United States Magistrate Judge in the District of New Jersey from 1986 to 2017. He is the chair of the Advisory Board of Digital Discovery & e-Evidence, a Bloomberg BNA publication, and is the principal author of the just-released third edition of Managing Discovery of Electronic Information: A Pocket Guide for Judges (Federal Judicial Center: 2017).


Mary Mack is a long-time industry expert with over 25 years of experience and leadership to her credit. Under her leadership, ACEDS furthers its commitment of building an international community of eDiscovery practitioners for the exchange of training, certification, professional development and networking. Mack is known for her strength in relationship and community building, as well as for the depth of her eDiscovery knowledge. Before joining ACEDS, Mary was the Enterprise Technology Counsel for ZyLAB, a global eDiscovery and Intelligent Information Governance software company focused on helping organizations insource eDiscovery to reduce legal spend and prevent privacy breaches and IP leakage. Prior to eDiscovery, Mary designed, coded, tested and maintained mission critical enterprise systems for banks, insurers and pharmaceutical companies. Certified in eDiscovery, security, access and identity management, forensics and computer telephony, Mary is admitted to the Illinois bar and a graduate of Northwestern University School of Law health. Mary is the author of   A Process of Illumination: The Practical Guide to Electronic Discovery and the co-editor of Thomson Reuters West’s eDiscovery for Corporate Counsel.

Thursday, August 30th, 2018 – 1-2 PM ET– Ask the Expert – Mike Quartararo on the eDiscovery Project Management Landscape – Tools Overview

The author of the first and only book on eDiscovery Project Management, Project Management in Electronic Discovery: An Introduction to Core Principles of Legal Project Management and Leadership In eDiscovery, which merges project management principles and best practices in electronic discovery, Mike Quartararo,  will be in the ACEDS webinar studio for an Ask the Expert.  Mike is the founder and managing director of eDPM Advisory Services, a consulting firm meeting the emerging eDiscovery, project management and legal technology needs of law firms, corporate legal departments and service provider organizations. Mike has been solving client problems using technology for 20 years. He has built his career upon strategic and innovative thinking, leadership and operational skills he honed while working at the best legal organizations in the world. A former law firm director, project manager, database specialist and paralegal, Mike has decades of experience delivering eDiscovery, project management and legal technology services to law firms and Fortune 500 corporations across the globe. He has worked in legal technology and litigation support at large law firms, including ten years at Skadden Arps Slate Meagher & Flom LLP, and more recently, as the firm-wide director of litigation support at Stroock & Stroock & Lavan LLP. As an adjunct professor at Arizona-based Bryan University, Mike co-designed and taught a graduate program on eDiscovery project management. He is graduate of the State University of New York and he studied law for one year at the University of London. He is a certified Project Management Professional (PMP) and a Certified eDiscovery Specialist (CEDS). He sits on the national board of the Association of eDiscovery Specialists and is the ACEDS liaison member to the Advisory Committee of Duke-EDRM. Mike frequently writes and speaks on issues related to project management, eDiscovery and litigation support.

As usual, Kaylee Walstad and Mary Mack will serve up some poll questions to see what is on the participants’ minds.  Mike will survey the eDiscovery Project Management tool landscape prior to our session. If you want your favorite tool represented, please reach out to Mike or Kaylee. Learn more and register for the webinar here: Ask the Expert: Mike Quartararo on the eDiscovery Project Management Landscape – Tools Overview.

All sessions qualify for CEDS credits and are reciprocal with AIIM’s continuing education credits to maintain the CIP certification. Many of the sessions above will qualify for ISC2 CPE credits, with the non-security sessions eligible as professional development. Please share these sessions with people who may be interested.

Gargantuan AI fail

Imagine this…a long day at a conference, your back hurts, your feet hurt and your mood is just a wee bit surly after networking all day. You wander through the hotel and see the light…a place to get a drink. You stop, unload your coat and computer bag, have a seat and look up to call the bartender over only to be greeted with a tablet screen. That’s right…no bartender.

You have found yourself at Robo Bar, a robotic bar system designed by the Italian company MAKR SHAKR.  What the heck, let’s try it out. To create and order drinks, you access a customized app on the pre-set tablet. All the orders currently in queue and the ingredients being added are displayed on four 92-inch LED screens, along with real-time infographics and videos.

You can then sit back and watch your “bartender”, whose name by the way, is Jengo, muddle, stir, shake, strain and serve drinks in a highly social, exciting and interactive environment. These robotic arms mimic the actions of a real bartender, from the shaking of a cocktail to the slicing of a lemon to the muddling of a Cuba Libre.  And, true story, because I read it on the Internet, in order to create an even more engaging bar experience, all the robotic movements were modeled on the gestures of the Italian dancer and choreographer Marco Pelle from the New York Theatre Ballet.

So, last week to celebrate my wedding anniversary and birthday, my husband and I took a trip to Biloxi, MS. At some point in my shopping frenzy, I found myself in front of one of the only Robo Bar’s in the city at the Beau Rivage Hotel and Casino. Hey, they have one at the Biloxi Hard Rock Casino as well. Robo Bar must be big on the Gulf Coast.

I pulled out my Samsung Note 9+ and started videotaping this robotic arm making drinks. The place was crowded, people were excited to try it and it was all quite novel…well, until IT happened. Jengo decided to go a little wild as the video will show. He? It? She? Well, whatever, that’s right, the robot decided to load up on glasses “just because.” It was an epic fail as glasses started falling over as Jengo tried to stack them one on top of the other and just kept stacking. No clue that glasses were falling all over, since, well, Jengo has no eyes. Everyone was laughing hysterically as the glass situation got out of hand. In rushed a human to turn Jengo off, push some buttons and set him straight. Back to business as usual.

I tell this story to point out the fact that AI is definitely not ready for prime time. All the technology in the world we’re using today, especially in the security field, still has its glitches. Things like:

  • Facial recognition
  • License plate recognition
  • Fingerprint recognition

Imagine the length of time and amount of data one needs to put together a true AI application. There are so many “what if this, then that” factors to consider. Bottom line is this, if you don’t have the data, then you most definitely don’t even want to touch AI.  According to asmag.com, China seems a bit ahead of the United States, where the US is still in the beginning stages of good reliable AI.  Andrew Elvish, VP of Marketing at Genetec, a global provider of IP video surveillance, access control and license plate recognition solutions stated that, “I think the computing power has a long way to go to catch up, and it’s not just a question of computing, it’s a question of do we have the algorithms and the ability to build true artificial intelligence.”

As far as the issue of eDiscovery and AI, will robots be replacing lawyers? According to David Lat, editor at large and founding editor of Above the Law, he stated “not anytime soon.” He also said that “certain parts of a lawyer’s job, especially those aspects that are more rote or mechanical, will be outsourced to technology.” We’re all familiar with TAR, but is it failproof? You make the call. Or you could swing by the closest Robo Bar and ask Jengo. Oh, that’s right, Jengo doesn’t talk to its customers.

IMNSHO, no matter how good the technology, there will always be mistakes and failures and unless there is a live person to circumvent the possible mistakes and failures of AI, it simply is too early in the game to rely on it 100%. Open the pod bay doors, Hal. Scary stuff!

For a good laugh and example of an epic AI fail, watch the 3 minute Jengo epic fail video on the YouTube eDiscovery Channel. See you next week.


CEDS Spotlight – South Africa

Danny Myburgh, CEDS Litigation Support Cyanre – The Digital Forensic Lab (Pty) Ltd Please share your thoughts on the certification training, how long it took you to prepare for the exam, thoughts on the exam and how it has benefitted you- both the knowledge gained from training and certification and being part of the ACEDS […]

CEDS Spotlight – Canada

Kathy Dallaire, CEDS

Please share your thoughts on the certification training, how long it took you to prepare for the exam, thoughts on the exam and how it has benefitted you- both the knowledge gained from training and certification and being part of the ACEDS community as a whole. (Whole experience)

I first became aware of ACEDS a number of years ago while preparing a business case for implementing litigation support services within our firm to centralize ediscovery services and support in-house as well as to standardize procedures and policies.

Since I still had a full schedule of paralegal work with trials and hearings, it took more time than anticipated to actually get to setting up our litigation support capabilities and eventually to the CEDS training and certification.

In light of my caseload, it made sense for me to split the training in two parts.  I took and successfully completed ACEDS’ ediscovery Essentials training in early 2017.  I found it to be just the right measure to encourage and push me to want to actually get certified.

My actual review of the materials for the exam began slowly in April of this year.  The materials are extremely well prepared and well written.  I registered for the live preparation seminars in June although I had not yet gotten through all of the materials.  The live sessions were given by Helen Bergman Moure and I was able to participate in all three.  To say the sessions were good and of value would be an understatement.  Helen was able to tie all of the information together in a logical and meaningful manner, with examples and real-life experience.

Finally, I decided to use a week of vacation time to review what remained of the materials and study for the exam.  I scheduled the exam for the Friday of that week and I planned to be completely ready.  I definitely underestimated the time it would take for a meaningful review but I was determined to get through it and make it to the exam as scheduled.  It made for long days and long nights but I actually enjoyed it and was totally immersed in the materials.  The legal framework was completely new for me as we have a different legal system, but the information was clear and concise.

As further preparation for the exam and following Helen’s advice during the preparation sessions, I printed and read through the relevant sections of the FRCP, including the Committee Notes on the Rules – this was essential advice.  I also took the practice exam as well as the “Test Your ediscovery IQ” questions on the ACEDS website which I found extremely useful in terms of the type of questions.  I read through other CEDS’ experience with the exam, including Mary Mack’s and I watched the prerecorded videos of the preparations sessions, again.

I was nervous, to say the least, but I got through the exam.  I agree with the advice given by others:  answer all questions, and time is of the essence.  The four hours passed quickly and when I reached the end, including time to review those questions I had marked for further review, I had only 10 minutes to spare.

Why did you decide to get certified? Do you have any other certifications?

Having provided support and services internally in an unofficial capacity for many years and then officially as Litigation Support Coordinator, it became important to be certified in order to have credibility.   More and more, clients are requiring specific ediscovery capabilities and credentials from the attorneys and firms they engage with. It is a fast-pace environment and providing clients with the assurance that the work is performed in the most efficient and state-of-the-art manner is essential.

On the personal side, actually going through the learning portion and successfully completing the exam gave me a definite boost of confidence.

I am very fortunate that the firm I have worked for for decades fully supports my efforts for certification and recognizes my efforts to that end.  I am very grateful for this support.

The other certifications I have obtained are as an AD Summation Administrator, AD Summation Case Manager and AD Summation Reviewer (consolidating into the AD Summation Specialist certification).

Please share your background of eDiscovery experience:

Although our jurisdiction does not have the same discovery obligations and we are not required to produce all potentially relevant information to adversaries in the discovery phase of litigation to the same extent, ediscovery principles do still apply in other phases of litigation for us.  Having the ability to process data and efficiently reduce the volume, knowing where to look for the information in clients’ environment, searching for responsive information within a collection to use as evidence, preparing productions, redactions and other branding, among others, are also essential in our legal context as well as being able to provide quality assurance for all of these processes.

Would you recommend our CEDS training/certification to other?


Advice to others looking to take the exam?

Give yourself enough time to go through the materials.  Follow the study guide.   Read the relevant sections of the FRCP and the comments.

I would recommend doing the Essentials training at the same time as the other materials but if you don’t take copious notes.  I was certainly glad I had.

Any other thoughts to share?

Gratefulness for the support and encouragement received along the way, from other CEDS, and all at ACEDS.  I am very proud to be finally have become apart of this community.

Sedona Conference Releases Primer on Social Media, 2nd Edition for Comments – July 2018

The Sedona Conference Working Group on Electronic Document Retention and Production has released a Primer on Social Media, 2nd Edition, and it is now available and open for public comment through September 10, 2018. Additionally, there will be a 90-minute webinar on the public comment version for Wednesday, August 8th at 1:00 P.M. EDT. There is a $99 fee for the public.  The webinar will be hosted by none other that Ken Withers of the Sedona Conference.

For those who aren’t familiar with the Sedona Conference, it is a 501(c)(3) research and educational institute dedicated to the advanced study of law and policy in the areas of antitrust law, complex litigation, and intellectual property rights. Their stated mission is to “move the law forward in a reasoned and just way” and to that end, they have been very active in the eDiscovery community over the years, most notably for the Sedona Conference Cooperation Proclamation. https://thesedonaconference.org/cooperation-proclamation

I’m glad to see this updated version to their 2012 Primer since “the times, they are a changin’.” This version prominently reflects the new issues that social media has brought to the litigation table, particularly in two areas: digital messaging and the 2015 amendments to the FRCP rules with the subsequent affect those changes had on using social media as evidence.

The primer from 2012 refers to the “hundreds of millions” of people on social media. It’s now billions! That version started off with primarily organizational issues whereas the new Primer wastes no time diving into the heart of new technology issues in discovery and the challenges that the various platforms, particularly messaging applications, bring to the table.

The second major focus is the issue that the changes to the FRCP brought into effect in December of 2015. The Primer specifically addresses the issues surrounding social media disputes and discusses in detail an area that I often address in CLEs, the authentication of social evidence.

Specifically, it discusses assessing whether to preserve, how to request with specificity, how to search for, and how to produce social media evidence, e.g.:

  • which social media sources are likely to contain relevant information;
  • who has possession, custody, or control of the social media data;
  • the date range of discoverable social media content;
  • what information is likely to be relevant;
  • the value of that information relative to the needs of the case;
  • the dynamic nature of the social media and user-generated content;
  • reasonable preservation and production formats; and
  • confidentiality and privacy concerns related to parties and non-parties.

Many model rules are cited throughout the Primer as well as relevant caselaw. Everything you want to know about proportionality, privacy, requesting social media evidence, who has possession, custody and control and just what is “control.”  It also delves into areas such as:

  • the details of preserving and collecting social media
  • the role cooperation plays
  • reasonable steps in dealing with social media
  • capturing static images and
  • self-collection as opposed to using an application interface.

I also learned from this Primer (at page 325) that it is possible to get an exact native file of collected content of a social media site. And of course, prominent considering the recent Carpenter decision, the Primer discusses the Stored Communications act and the various restrictions. Even though a third party may store your data, it doesn’t mean that a person litigating against you cannot gain access to that data.

Finally, sections on Cross Border Discovery Issues and Ethical Considerations ensure that this Primer is completely up to date with the most recent topics in eDiscovery.  And to further guarantee that it stays current, the primer is open for public comment through September 10, 2018 by email to comments@sedonaconference.org and suggestions for improvement are very welcome. After the deadline for public comment has passed, the drafting team will review the public comments and determine what edits are appropriate for the final version.


57 Ways to Leave Your (Linear) Lover – A Case Study on Using Insight Predict to Find Relevant Documents Without SME Training

A Big Four accounting firm with offices in Tokyo recently asked Catalyst to demonstrate the effectiveness of Insight Predict, technology assisted review (TAR) based on continuous active learning (CAL), on a Japanese language investigation. They gave us a test population of about 5,000 documents which had already been tagged for relevance. In fact, they only found 55 relevant documents during their linear review.

We offered to run a free simulation designed to show how quickly Predict would have found those same relevant documents. The simulation would be blind (Predict would not know how the documents were tagged until it presented its ranked list.) That way we could simulate an actual Predict review using CAL.

We structured a simulated Predict review to be as realistic as possible, looking at the investigation from every conceivable angle. The results were outstanding; we couldn’t believe what we saw. So, we ran it again, using a different starting seed. And again. And again. In fact, we did 57 different simulations starting with relevant seeds (singularly with each relevant document). A non-relevant seed. And a synthetic seed.

Regardless of the starting point, Predict was able to locate 100% of the relevant documents after reviewing only a fraction of the collection. You won’t believe your eyes either.

Complicating Factors

Everything about this investigation would normally be challenging for a TAR project.

To begin with, the entire collection was in Japanese. Like other Asian languages, Japanese documents require special attention for proper indexing, which is the first step in feature extraction for a technology assisted review. At Catalyst, we incorporate semantic tokenization of the CJK languages directly into our indexing and feature extraction process. The value of that approach for a TAR project cannot be overstated.

To complicate matters further, the collection itself was relatively small, and sparse. There were only 4,662 coded documents in the collection and, of those, only 55 total documents were considered responsive to the investigation. That puts overall richness at only 1.2%.

The following example illustrates why richness and collection size together compound the difficulty of a project. Imagine a collection of 100,000 documents that is 10% rich. That means that there are 10,000 responsive documents. That’s a large enough set that a machine learning-based TAR engine will likely do a good job finding most of those 10,000 documents.

Next, imagine another collection of one million documents that is 1% rich.  That means that there are also 10,000 responsive documents. That is still a sizeable enough set of responsive documents to be able to train and use TAR machinery, even though richness is only 1%.

Now, however, imagine a collection of only 100 documents that is 1% rich. That means that only 1 document is responsive. Which means that either you’ve found it, or you haven’t. There are no other responsive documents other than that document itself, so there are no other documents that, through training of a machine learning algorithm, can lead you to the one responsive document. So a 1% rich million document collection is a very different creature than a 1% rich 100 document collection. These are extreme examples, but they illustrate the point that small collections are difficult and low richness collections are difficult, but small, low richness collections are extremely difficult.

Small collections like these are nearly impossible for traditional TAR systems because it is difficult to find seed documents for training. In contrast, Predict can start the training with the very first coded document. This means that Predict can quickly locate and prioritize responsive documents for review, even in small document sets with low richness.

Compounding these constraints, nearly 20% (10 out of 55) of the responsive documents were hard copy Japanese documents that had to be OCR’d. As a general matter, it can be somewhat difficult to effectively OCR Japanese script because of the size of the character set, the complexity of individual characters, and the similarities between the Kanji character structures. Poor OCR will impair feature extraction which will, in turn, diminish the value of a document for training purposes, making it much more difficult to find responsive documents, let alone find them all.

Simulation Protocol

To test Predict, we implemented a fairly standard simulation protocol—one that we used for NIST’s TREC program and often use to let prospective clients see how well Predict might work on their own projects. After making the text of the documents available to be ingested into Predict, we simulate a Predict prioritized review using the existing coding judgments in a just in time manner, and we prepare a gain curve to show how quickly responsive documents are located.

Since this collection was already loaded into our discovery platform, Insight Discovery, we had everything we needed to get the simulation underway: document identification numbers (Bates numbers); extracted text and images for the OCR’d documents; and responsiveness judgments. Otherwise, the client simply could have provided that same information in a load file.

With the data loaded, we simulated different Predict reviews of the entire collection to see how quickly responsive documents would be located using different starting seeds. To be sure, we didn‘t need to do this just to convince the client that Predict is effective; we wanted to do our own little scientific experimentation as well.

Here is how the simulation worked:

  1. In each experiment, we began by choosing a single seed document to initiate the Predict ranking, to which we applied the client’s responsiveness judgment. We then ranked the documents based on that single seed.[1]
  2. Once the initial ranking was complete, we selected the top twenty documents for coding in ranked order (with their actual relevance judgments hidden from Predict).[2]
  3. We next applied the proper responsiveness judgments to those twenty documents to simulate the review of a batch of documents, and then we submitted all of those coded documents to initiate another Predict ranking.

We continued this process until we had found all the responsive documents in the course of each review.

First Simulation

We used a relevant document to start the CAL process for our first simulation. In this case, we selected a relevant document randomly to be used as a starting seed. We then let Predict rank the remaining documents based on the initial seed and present the 20 highest-ranked documents for review. We gave Predict the tagged values (relevant or not) for these documents and ran a second ranking (now based on 21 seeds). We continued the process until we ran out of documents.

Figure 1

As is our practice, we used a gain curve to uniformly evaluate the results of the simulated reviews. A gain curve is helpful because it allows you to easily visualize the effectiveness of every review. On the horizontal x-axis, we plot the number of documents reviewed at every point in the simulation. On the vertical y-axis, we plot the number of documents coded as responsive at each of those points. The faster the gain curve rises, the better, because that means you are finding more responsive documents more quickly, and with less review effort.

The linear line across the diagonal shows how a linear review would work, with the review team finding 50% of the relevant documents after reviewing 50% of the total document population and 100% after reviewing 100% of the total.

The red line in Figure 1 shows the results of the first simulation, using the single initial random seed as a starting point (compared to the black line, representing linear review). Predict quickly prioritized 33 responsive documents, achieving a 60% recall upon review of only 92 documents.

While Predict efficiency diminished somewhat as the responsive population was depleted, and the relative proportion of OCR documents was increasing, Predict was able to prioritize fully 100% of the responsive documents within the first 1,491 documents reviewed (32% of the entire collection). That represents a savings of 68% of the time and effort that would have been required for a linear review.

Second Test

The results from the first random seed looked so good that we decided to try a second random seed, to make sure it wasn’t pure happenstance. Those results were just as good.

Figure 2

In Figure 2, the gray line reflects the results of the second simulation, starting with the second random seed. The Predict results were virtually indistinguishable through 55% recall, but were slightly less efficient at 60% recall (requiring the review of 168 documents). The overall Predict efficiency recovered almost completely, however, prioritizing 100% of the responsive documents within the first 1,507 documents (32.3%) reviewed in the collection—a savings again of nearly 68% compared with linear review.

Third Simulation

The results from the first and second runs were so good that we decided to continue experimenting. In the next round we wanted to see what would happen if we used a  lower-ranked (more difficult for the algorithm to find) seed to start the process. To accomplish that, we chose the lowest-ranked relevant document found by Predict in the first two rounds as a starting seed. This turned out to be an OCR’d document (which was likely the most unique responsive document) to initiate the ranking. To our surprise, Predict was just about as effective starting with this lowly-ranked seed as it had been before. Take a look and see for yourself.[3]

Figure 3

The yellow line in Figure 3 shows what happened when we started with the last document located during the first two simulations. The impact of starting with a document that, while responsive, differs significantly from most other responsive documents is obvious. After reviewing the first 72 documents prioritized by Predict, only one responsive document had been found. However, the ability of Predict to quickly recover efficiency when pockets of responsive documents are found is obvious as well. Recall reached 60% upon review of just 179 documents — only slightly more than what was required in the second simulation. And then the Predict efficiency surpassed both previous simulations, achieving 100% recall upon review of only 1,333 documents—28.6% of the collection, and a savings of 71.4% against a linear review.

Fourth Round

We couldn’t stop here. For the next round, we decided to use a random non-responsive document as the starting point. To our surprise, the results were just as good as the earlier rounds. Figure 4 illustrates these results.

Figure 4

Fifth Round

We decided to make one more simulation run just to see what happened. For this final starting point, we created a synthetic responsive Japanese document. We composited five responsive documents selected at random into a single synthetic seed, started there, and achieved much the same results.[4]

Figure 5

Sixth through 56th Rounds

The consistency of these five results seemed really interesting so for the heck of it we ran simulations using every single responsive document in the collection as a starting point. So, although it wasn’t our plan at the outset, we ultimately simulated 57 Predict reviews across the collection, each from a different starting point (all 55 relevant documents, one non-relevant document, and one synthetic seed).

You can see for yourself from Figure 6 that the results from every simulated starting point were, for the most part, pretty consistent. Regardless of the starting point, once Predict was able to locate a pocket of responsive documents, the gain curve jumped almost straight up until about 60% of the responsive documents had been located.

Gordon Cormack once analogized this ability of a continuous active learning tool to a bloodhound—all you need to do is give Predict the “scent” of a responsive document, and it tracks them down. And in every case, Predict was able to find every one of the responsive documents without having to review even one-third of the collection.

Here is a graph showing the results for all of our simulations:

Figure 6

And here are the specifics of each simulation at recall levels of 60%, 80% and 100% recall.

DocID Percentage of Collection Reviewed to Achieve Recall Levels
60% 80% 100%
27096 4% 15% 29%
34000 2% 11% 32%
35004 4% 12% 32%
83204 3% 11% 32%
86395 4% 14% 32%
93664 2% 13% 32%
98263 3% 11% 29%
98391 2% 13% 32%
98945 3% 11% 32%
99708 4% 12% 32%
99773 2% 10% 32%
99812 2% 11% 32%
99883 2% 12% 32%
99918 5% 14% 32%
100443 4% 12% 32%
100876 3% 13% 32%
101211 4% 12% 32%
101705 3% 14% 31%
101829 3% 11% 31%
102395 3% 13% 32%
102432 4% 14% 32%
102499 2% 9% 32%
102705 3% 14% 32%
103803 4% 12% 32%
105017 2% 14% 32%
105799 3% 13% 32%
106993 2% 12% 30%
107315 2% 14% 32%
109883 4% 12% 32%
110350 3% 15% 30%
112905 4% 14% 32%
117037 4% 12% 32%
118353 4% 14% 32%
119216 4% 15% 32%
119258 2% 12% 32%
119362 2% 10% 32%
121859 3% 11% 32%
122000 4% 15% 29%
122380 5% 11% 30%
123626 3% 10% 32%
123887 3% 11% 32%
124517 3% 14% 32%
125901 3% 14% 32%
130558 2% 14% 32%
131255 4% 10% 32%
132604 2% 10% 32%
136819 3% 14% 29%
140265 4% 13% 32%
140543 4% 12% 32%
147820 3% 14% 32%
154413 4% 13% 32%
238202 4% 12% 32%
242068 4% 12% 32%
245309 4% 16% 32%
248571 4% 12% 32%
NR 3% 14% 32%
SS 2% 13% 31%
Min 2% 9% 29%
Max 5% 16% 32%
Avg 3% 13% 32%

Table 1

As you can see, the overall results mirrored our earlier experiments, which makes a powerful statement about the ease of using a CAL process. Special search techniques and different training starts seemed to make very little difference in these experiments. We saw this through our TREC 2016 experiments as well. We tested different, and minimalist, methods of starting the seeding process (e.g. one quick search, limited searching), and found little difference in the results. See our report and study here.

What did we learn from the simulations?

One of the primary benefits of a simulation as opposed to running CAL on a live matter is that you can pretty much vary and control every aspect of your review to see how the system and results change when the parameters of the review change. In this case, we varied the starting point, but kept every other aspect of the simulated review constant. That way, we could compare multiple simulations against each other and determine where there may be differences, and whether one approach is better than any other.

The important takeaway is the fact that the review order of these various experiments is exactly the same review order that the client would achieve, had they reviewed these documents in Predict, at a standard review rate of about one document per minute, and made the exact same responsiveness decisions on the same documents.

Averaged across all the experiments we did, Predict was able to find just over half of all responsive documents (50% recall) after reviewing only 89 documents (1.9% of the collection; 98.1% savings). Predict achieved 75% recall after reviewing only 534 documents (11.5% of the collection; 88.5% savings).  And finally, Predict achieved an otherwise unheard of complete 100% recall on this collection after reviewing only 1,450 documents (31.1% of the collection; 68.9% savings).

Furthermore, Predict is robust to differences in initial starting conditions. Some starting conditions are slightly better than others. In one case, we achieved 50% recall after only 65 documents (1.4% of the collection; 98.6% savings) whereas in another it took 163 documents to reach  50% recall (3.5% of the collection; 96.5% savings). However, the latter example achieved 100% recall after only 1,352 documents (29% of the collection; 71% savings), whereas the earlier example achieved 100% recall after 1,507 documents (32.3% of the collection; 67.7% savings).

Overall, the key is not to focus on minute differences, because all these results are within a relatively narrow performance range and follow the same general trend.

Other key takeaways:

  1. Predict’s implementation of CAL works extremely well on low richness collections. Starting with only 55 relevant documents out of nearly 5,000 typically makes finding the next relevant document difficult, but Predict excelled with a low richness collection.
  2. This case involved OCR’d documents. Some people have suggested that TAR might not work well with OCR’d text but that has not been our experience. Predict worked well with this population.
  3. All documents were in Japanese. We have written about our success in ranking non-English documents but some have expressed doubt. This study again illustrates the effectiveness of Predict’s analytical tools when the documents are properly tokenized.

These experiments show that there are real, significant savings to using Predict, no matter the size, richness or language of the document collection.


Paul Simon, that great legal technologist, knew long ago that it was time to put an end to keywords and linear review:

The problem is all inside your head, she said to me.
The answer is easy if you take it logically.
I’d like to help you as we become keyword free.
There must be fifty-seven ways to leave your (linear) lover.

She said it’s really not my habit to intrude.
But this wasteful spending means your clients are getting screwed.
So I repeat myself, at the risk of being cruel.
There must be fifty-seven ways to leave your linear lover,
Fifty-seven ways to leave your (linear) lover.

Just slip out the back, Peck, make a new plan, Ralph.
Don’t need to be coy, Gord, just listen to me.
Hop on the bus, Craig, don’t need to discuss much.
Just drop the keywords, Mary, and get yourself (linear) free.

She said it grieves me so to see you in such pain.
When you drop those keywords I know you’ll smile again.
I said, linear review is as expensive as can be.
There must be fifty-seven ways ways to leave your (linear) lover.

Just slip out the back, Shira, make a new plan, Gord.
Don’t need to be coy, Joy, just listen to me.
Hop on the bus, Tom, don’t need to discuss much.
Just drop the keywords, Gayle, and get yourself (linear) free.

She said, why don’t we both just sleep on it tonight.
And I believe, in the morning you’ll begin to see the light.
When the review team sent their bill I realized she probably was right.
There must be fifty-seven ways to leave your (linear) lover.
Fifty-seven ways to leave your (linear) lover.

Just slip out the back, Maura, make a new plan, Fatch.
Don’t need to be coy, Andrew, just listen to me.
Hop on the bus, Michael, don’t need to discuss much.
Just drop off the keywords, Herb, and get yourself (linear) free.



[1] We chose to initiate the ranking using a single document simply to see how well Predict would perform in this investigation from the absolute minimum starting point. In reality, a Predict simulation can use as many responsive and non-responsive documents as desired. In most cases, we use the same starting point (i.e., the exact same documents and judgments) used by the client to initiate the original review that is being simulated.

[2] We chose to review twenty documents at a time because that is what we typically recommend for batch sizes in an investigation, to take maximum advantage of the ability of Predict to re-rank several times an hour.

[3] It is interesting to note that Predict did not find relevant documents as quickly using a non-relevant starting seed, which isn’t surprising. However, it caught up with the earlier simulation by the 70% mark and proved just as effective.

[4] Compositing the text of five responsive documents into one is a reasonable experiment to run. But it’s not what most people think of when they think synthetic seed. They imagine some lawyer crafting verbiage him- or herself, writing something up about what they expect to find, in their own words. And then using that document to start the training. Using the literal text of five documents already deemed to be responsive is not the same thing but it made for an interesting experiment.

Judge Facciola Says Carpenter Decision May Signal the End of the Third Party Doctrine

The Carpenter decision has been the focus of many discussions since it came down last week.  In a closely watched case, a 5-4 SCOTUS ruled that police access to a person’s historical cell phone tower site records (7 days or more) is a violation of the Fourth Amendment because it violates the persons legitimate expectation of privacy. The Court held that for these records, a search warrant is mandatory.

I was hoping for a slightly different perspective than the much-written commentary on privacy and the Fourth Amendment. A little voice told me to ask for a reaction from someone who has a deep interest in the topic, Judge John Facciola. Actually, it was the not-so-little voice of Tom O’Connor who said, when I mentioned my topic, that he and Judge Facciola had engaged in numerous conversations about the declining state of privacy.  So, I thought, well what the heck, maybe he’ll talk to me about it too!

For those unfamiliar with his background, the Honorable Judge Facciola is a retired United States Magistrate Judge for the United States District Court for the District of Columbia. He is currently an Adjunct Professor at Georgetown University School of Law, an eDiscovery expert and preeminent scholar in this area of law. Oh yes, he is also a native New Yorker and a graduate of a small Catholic college in Woostah, MA, which endears him to Tom and fervent fan of The Boss, which is what really endears him to me.

So, I sent the Judge an email asking if had any comments on the case and imagine my surprise when a short time later my phone rang and it was him.  A judge! Calling me!! Wow!!!

We had a great chat and of course touched on both the Fourth Amendment and the Third-Party Doctrine. And again, a little explanation here just to set the framework for the discussion. The Fourth Amendment was part of the Bill of Rights added to the Constitution on December 15, 1791. It protects people from unlawful searches and seizures which means that the police can’t search you or your house without a warrant or some so called “exigent circumstances” (eg, an imminent threat of bodily harm) which allows them to proceed without a warrant.

The Third-Party Doctrine is a legal theory which holds that people who voluntarily give information to third parties—such as banks, phone companies, internet service providers (ISPs) or SaaS companies —have no “reasonable expectation of privacy.” The expectation of privacy is crucial to distinguishing a legitimate, reasonable police search from an unreasonable one and goes back to Katz v. United States389 U.S. 347(1967).

Disputes over using cell phone data to track an individual go back for years (see M. Wesley Clark, Cell Phones as Tracking Devices, 41 Val. U. L. Rev. 1413 [2007]) and Judge Facciola has long been an opponent of this trend.  As far back as January of 2006, in a case involving a broad interpretation of the PATRIOT Act expansion of the definition of “pen register,” he wrote:

It is inconceivable to me that the Congress that precluded the use of the Pen Register statute to secure in 1994 ‘transaction data’…nevertheless intended to permit the government to use that same statute…to secure the infinitely more intrusive information about the location of a cell phone every minute of every day that the cell phone was on. I cannot predicate such a counter-intuitive conclusion on the single word ‘solely.’ (In the Matter of the Application of the United States of America for an Order Authorizing the Release of Prospective Cell Site Information, 2006 WL 41229 [D.D.C. Jan. 6, 2006])

So, it came as no surprise that when speaking with Judge Facciola, he agreed with the decision in Carpenter since the Court seemed keenly aware that, as CJ Roberts stated in his opinion, we are now faced with “… an entirely different species of business record—something that implicates basic Fourth Amendment concerns about arbitrary government power much more directly than corporate tax or payroll ledgers. When confronting new concerns wrought by digital technology, this Court has been careful not to uncritically extend existing precedents. See Riley, 573 U. S., at ___ (slip op., at 10).”

The old view of the third-party doctrine must yield to new concerns about recent technology or what CJ Roberts called “the critical issue” of “basic Fourth Amendment concerns about arbitrary government power” that are “wrought by digital technology.”

Overall, the Roberts Court seems to understand electronic privacy’s importance, especially when Carpenter is coupled with the previous decisions in US v Jones (2011), which required a warrant before police placed a GPS tracker on a vehicle and Riley v California (2014) which forbade warrantless searches of a cell phone during an arrest.

First, he felt that Justice Gorsuch had a world to say in his dissenting opinion and that it truly has far reaching implications for the future of privacy.  J Gorsuch felt that the majority did not confront the third-party doctrine head-on and relied, instead, on the nature of the data in question.  His dissent was much like a concurrence on other grounds, but he felt that rather than focus on the reasonable expectation of privacy analysis, the court should have followed a property rights-based theory of the Fourth Amendment, focusing on the exact words of the amendment which speak of a search of a person’s “papers and effect.” Doesn’t that include the data a person creates and doesn’t the ban on unreasonable searches pertain to it?

Thus, with the advent of a new appointment to the Court on the horizon, the Gorsuch dissent may send a message to future defendants that the inclusion of a property-based argument will be necessary to carry the day as the Court retreats from the vague notion of an “expectation of privacy” analysis to one premised on the words of the Fourth Amendment.

In addition, the Judge notes that setting a warrant standard isn’t the end of the discussion, but the beginning.  Yes, a warrant can be issued only based on “probable cause” but it must particularize what is to be searched and seized. How will that requirement be met when the government seeks the entire contents of a digital file, whether it is a Facebook page or the GPS location now buried in my Galaxy?

I concluded by asking Judge Facciola to sit back in his chair and think it over one more time and tell me whether someday we’ll look back on this and it will all seem funny.  His reply? “The fun is just beginning.”