Invention Grant
US09275129B2 Methods and systems to efficiently find similar and near-duplicate emails and files
有权
方法和系统可以有效地查找类似和近似重复的电子邮件和文件
- Patent Title: Methods and systems to efficiently find similar and near-duplicate emails and files
- Patent Title (中): 方法和系统可以有效地查找类似和近似重复的电子邮件和文件
-
Application No.: US13028841Application Date: 2011-02-16
-
Publication No.: US09275129B2Publication Date: 2016-03-01
- Inventor: Malay Desai , Medha Shewale , Venkat Rangan
- Applicant: Malay Desai , Medha Shewale , Venkat Rangan
- Applicant Address: US CA Mountain View
- Assignee: Symantec Corporation
- Current Assignee: Symantec Corporation
- Current Assignee Address: US CA Mountain View
- Agency: Wilmer Cutler Pickering Hale and Dorr LLP
- Main IPC: G06F17/30
- IPC: G06F17/30

Abstract:
A set of trigrams can be generated for each document in a plurality of documents processed by an e-discovery system. Each trigram in the set of trigrams for a given document is a sequence of three terms in the given document. A set of trigrams for each similar document is then determined based on the set of trigrams for the original document. To facilitate identification of the similar documents, a full text index is then generated for the plurality of documents and the set of trigrams for each document are indexed into the full text index, as individual terms. Queries can be generated into the full text index based on trigrams of a document to determine other similar or near-duplicate documents. After a set of potentially similar documents are identified, a separate distance criteria can be applied to evaluate the level of similarity between the two documents in an efficient way.
Public/Granted literature
- US20120209853A1 METHODS AND SYSTEMS TO EFFICIENTLY FIND SIMILAR AND NEAR-DUPLICATE EMAILS AND FILES Public/Granted day:2012-08-16
Information query