Invention Grant
- Patent Title: Data extraction method, computer program product and system
- Patent Title (中): 数据提取方法,计算机程序产品和系统
-
Application No.: US13258480Application Date: 2009-11-25
-
Publication No.: US08667015B2Publication Date: 2014-03-04
- Inventor: Li-Mei Jiao , Yuhong Xiong
- Applicant: Li-Mei Jiao , Yuhong Xiong
- Applicant Address: US TX Houston
- Assignee: Hewlett-Packard Development Company, L.P.
- Current Assignee: Hewlett-Packard Development Company, L.P.
- Current Assignee Address: US TX Houston
- International Application: PCT/CN2009/075117 WO 20091125
- International Announcement: WO2011/063561 WO 20110603
- Main IPC: G06F17/30
- IPC: G06F17/30

Abstract:
Disclosed is a method of automatically extracting data from a target web page, comprising selecting (302) data in a source web page; determining (304) the respective DOM (document object model) trees of the source and target web page, and identifying the one or more nodes comprising the selected data in the source web page DOM tree; determining (306) matching paths in the respective DOM trees; for selected data in a node of an unmatched branch of the source web page DOM tree, identifying (308) the nearest matched path in the source web page; identifying (310) the unmatched branch nearest to the corresponding matched path in the target web page; determining (312) if said identified unmatched branch in the target web page DOM tree comprises a target node matching the selected data node; and if so: extracting (322) data from the target node if the mismatch between the respective unmatched branches does not exceed a predefined threshold. A computer program product and system implementing this method are also disclosed.
Public/Granted literature
- US20120059859A1 Data Extraction Method, Computer Program Product and System Public/Granted day:2012-03-08
Information query