Question: I should have the data early next week to start loading. It’s looking like it will be about X gigs. Can you remind me how Disco handles de-duping? We’ve collected the email of a number of custodians and want to be sure to globally de-dup before the review begins. When de-duping happens, does Disco capture all the de-duped custodians in a field so when it comes time to produce we can provide that information to opposing counsel?


Answer:  Deduplication works as follows:

If an email/document comes in and it matches everything down to the byte level of the document including their attachments (i.e. identical "hash" value), DISCO will deduplicate those items into one document/email but will note in the metadata record the multiple instances.

During ingest, the custodian information actually doesn't affect deduplication, although we do capture multiple custodian information as metadata.

As you review, you may also view all custodians associated with a deduplicated document (e.g. all instances) within the document viewer or search results.

When producing, you are able to produce a single instance of a duplicate document, or more than one instance, depending on the deduplication level you chose during production. 

Therefore, if two identical attachments are on two different emails, both copies of the attachments are seen in DISCO (so you can review and tag them differently if necessary based on the different "parent" emails to which they are attached). These two identical attachments would not be considered as duplicates, because they would have a different "hash" value as a result of having two different "parent" emails.

