Searching multilingual documents based on document structure extraction

Invention Grant

US11222053B2 Searching multilingual documents based on document structure extraction 有权

Please log in to see more content

Patent Title: Searching multilingual documents based on document structure extraction
Application No.: US16866646

Application Date: 2020-05-05
Publication No.: US11222053B2

Publication Date: 2022-01-11
Inventor: Xin Tang , Kun Yan Yin , He Li , Xueliang Zhao , Xin Xu
Applicant: INTERNATIONAL BUSINESS MACHINES CORPORATION
Applicant Address: US NY Armonk
Assignee: INTERNATIONAL BUSINESS MACHINES CORPORATION
Current Assignee: INTERNATIONAL BUSINESS MACHINES CORPORATION
Current Assignee Address: US NY Armonk
Agency: Schmeiser, Olsen & Watts
Agent Nicholas L. Cadmus
Main IPC: G06F16/33
IPC: G06F16/33 ; G06F16/93 ; G06F16/338 ; G06F16/35 ; G06F16/9535 ; G06F40/40 ; G06F40/58

Searching multilingual documents based on document structure extraction

Abstract:

An approach is provided for searching multilingual documents. A first classification is determined that includes a first document and other document(s) by minimizing a first distance between a first numerical fixed length vector for the first document and other numerical fixed length vector(s) for other document(s). Based on a query and a natural language detected in the query, a second document is selected. A second stream modeling the second document is encoded as a second numerical fixed length vector. Based on a distance between the first and second numerical fixed length vectors being less than a threshold, the first classification is identified as including the second document. Documents in the first classification are ranked and presented as having content matching the second document's content. At least one of the ranked documents is expressed in a natural language different from the natural language of the second document.

Public/Granted literature

US20200265074A1 SEARCHING MULTILINGUAL DOCUMENTS BASED ON DOCUMENT STRUCTURE EXTRACTION Public/Granted day:2020-08-20

Information query

Espacenet

IPC分类:

G	物理
G06	计算；推算或计数
G06F	电数字数据处理（基于特定计算模型的计算机系统入G06N）
G06F16/00	信息检索；数据库结构；文件系统结构
G06F16/30	.•非结构文本数据（文档管理系统入G06F 16/93）
G06F16/33	..••查询