Invention Grant
- Patent Title: Methods and apparatus for removing a duplicated web page
-
Application No.: US15582322Application Date: 2017-04-28
-
Publication No.: US10691769B2Publication Date: 2020-06-23
- Inventor: Xiaopeng Tang
- Applicant: ALIBABA GROUP HOLDING LIMITED
- Applicant Address: KY Grand Cayman
- Assignee: ALIBABA GROUP HOLDING LIMITED
- Current Assignee: ALIBABA GROUP HOLDING LIMITED
- Current Assignee Address: KY Grand Cayman
- Agency: Finnegan, Henderson, Farabow, Garrett & Dunner, LLP
- Priority: com.zzzhc.datahub.patent.etl.us.BibliographicData$PriorityClaim@552083d6
- Main IPC: G06F16/958
- IPC: G06F16/958 ; G06F16/951 ; G06F16/22 ; G06F16/23 ; G06F16/00 ; H04L29/08

Abstract:
Methods and Apparatuses are disclosed for removing a duplicated web page. An exemplary method may include acquiring a plurality of web pages of a predetermined type extracting a feature code of a current web page and a number of text characters contained in the current web page for each web page. The method may also include looking up a data table to determine whether the feature code is contained in the data table. If the feature code is contained in the data table, the method may further include reading a number of text characters of the web page in the data table corresponding to the feature code, and discarding the current web page when a difference between the read number of text characters and the extracted number of the text characters is within a range.
Public/Granted literature
- US20170235746A1 METHODS AND APPARATUS FOR REMOVING A DUPLICATED WEB PAGE Public/Granted day:2017-08-17
Information query