Token matching in large document corpora

Invention Grant

US10248646B1 Token matching in large document corpora 有权

Please log in to see more content

Patent Title: Token matching in large document corpora
Application No.: US16108497

Application Date: 2018-08-22
Publication No.: US10248646B1

Publication Date: 2019-04-02
Inventor: Guy Leibovitz
Applicant: COGNIGO RESEARCH LTD.
Applicant Address: IL Tel Aviv
Assignee: COGNIGO RESEARCH LTD.
Current Assignee: COGNIGO RESEARCH LTD.
Current Assignee Address: IL Tel Aviv
Agency: The Roy Gross Law Firm, LLC
Agent Roy Gross
Main IPC: G06F17/18
IPC: G06F17/18 ; G06F17/27 ; G06F17/30 ; G06F16/31 ; G06F16/33

Token matching in large document corpora

Abstract:

A method comprising receiving a dictionary comprising a plurality of entities, wherein each entity has a length of between 1 and n tokens; constructing a probabilistic data representation model comprising n Bloom filter (BF) pairs indexed from 1 to n; populating said probabilistic data representation model with a data representation of said entities, wherein, with respect to each BF pair indexed i: (i) a first BF is populated with the first i tokens of all said entities having at least i+1 tokens, and (ii) a second BF in populated with all said entities having exactly i tokens; receiving a text corpus, wherein said text corpus is segmented into tokens; and automatically matching each token in said text corpus against said populated probabilistic data representation model, wherein said matching comprises sequentially querying each said BF pair in the order of said indexing, to determine a match.

Information query

Espacenet

IPC分类:

G	物理
G06	计算；推算或计数
G06F	电数字数据处理（基于特定计算模型的计算机系统入G06N）
G06F17/00	特别适用于特定功能的数字计算设备或数据处理设备或数据处理方法（信息检索，数据库结构或文件系统结构，G06F 16/00）
G06F17/10	.复杂数学运算的
G06F17/18	..用于换算统计数据的