Production deduplication

In order to understand DISCO’s production deduplication options, you must first understand how DISCO deduplicates documents upon ingest.  As part of our standard processing, DISCO will deduplicate data across all custodians and data sources while maintaining complete families.  This process results in a single record for each member of a family or stand-alone document.  All Instances of duplicate data are maintained in DISCO’s data storage but only one record of each document will be displayed for reviewFurthermore, upon production you can choose to produce either one instance of each duplicate record, one instance of each duplicate per custodian, or all instances of each duplicate. 

Once your documents have been ingested into DISCO, you will be able to review each document (or record) and categorize each by applying tags or placing them into folders.  Once you have completed the review and categorization of your documents, you will be ready to create a production.   

On the New Production page, in the Advanced options section, you will be able to choose document and load file formats, volume labels and breaks, and how your documents will be sorted within your production. Furthermore, you can choose to produce as native by file type or tag, create custom slipsheets along with slipsheet rules, and include a native file for each document (unless redacted). While some options, such as producing natively with slipsheets, will impact the overall page count of your production, it is the production deduplication level that will impact the number of documents that are produced. 


DISCO offers three levels of deduplication:

  • Global deduplication by family (Default) – Produces each duplicated family in the production one time.
  • Custodian-level deduplication by family – Produces a separate copy for each custodian associated with a duplicate family.
  • Full reduplication – Produces documents as they were ingested into DISCO, prior to deduplication.

To further understand how production deduplication levels work, we will walk through an example scenario. You are working on a case in which you collect the mailboxes of four custodians (Ann, Bob, Charles, and Danielle) from your client’s email server. In addition to the emails found on the server, Charles has a local copy of his email, which is also collected. All five mailboxes are ingested into DISCO and globally deduplicated so that your team only reviews one record of each document.  

It is important to note that when families are processed by DISCO, each family member (or document) becomes its own individual record in DISCO. In this example, the email and each attachment will result in a total of three unique records. This allows you the flexibility to produce, slipsheet, or withhold any family individual family member as needed.

Now, during your review you find that each of the four custodians is a recipient to an email that has two attachments. Again, while this email was contained in each of the mailboxes collected and therefore ingested five times, your team will only review it once. Let’s say you have decided to produce the email and both attachments. Here are the production results based on deduplication level:

  • Global deduplication by family – DISCO will produce the email and two attachments, for a total of three documents (exactly how it is displayed for review).
  • Custodian-level deduplication by family – DISCO will produce each family four times, once for each custodian, for a total of twelve documents.
  • Full reduplication – DISCO will produce each family five times, replicating what was ingested, for a total of fifteen documents.

It is important to note that regardless of how many instances of a document you decide to produce, you can choose to include all the metadata associated with the various instances of that document. When creating a custom load file, you can choose to include the following fields of instance metadata:

  • Custodian – Name of the custodian of the first instance of a document or selected custodian
  • DupCustodian – Name(s) of the custodians all of the duplicate instances
  • AllCustodians – Names of all the custodians for all the instances
  • FileName – File name of the the first instance of a document or selected custodian
  • DupFileName – File name(s) for all of the duplicate instances
  • AllFileNames – File names for all instances
  • Path – File path for the first instance of a document or selected custodian
  • DupPath – File path(s) for all of the duplicate instances
  • AllFilePaths – File paths for all instances


Was this article helpful?
0 out of 0 found this helpful
Have more questions? Submit a request


Chat is online
Chat is woffline