CN108205544A - Web page contents recognition methods, device, server - Google Patents
Web page contents recognition methods, device, server Download PDFInfo
- Publication number
- CN108205544A CN108205544A CN201611170430.3A CN201611170430A CN108205544A CN 108205544 A CN108205544 A CN 108205544A CN 201611170430 A CN201611170430 A CN 201611170430A CN 108205544 A CN108205544 A CN 108205544A
- Authority
- CN
- China
- Prior art keywords
- web page
- page contents
- webpage
- content
- visual signature
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/955—Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
- G06F16/9566—URL specific, e.g. using aliases, detecting broken or misspelled links
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/958—Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- General Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- Software Systems (AREA)
- Evolutionary Computation (AREA)
- Medical Informatics (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Artificial Intelligence (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a kind of web page contents recognition methods, the web page contents recognition methods includes determining at least one trained website, and acquire multiple trained webpages in each trained website;Obtain the visual signature of the corresponding block of content being selected in each training webpage;Data processing is carried out to the visual signature and obtains feature vector;And the identification model of the chosen content is established according to described eigenvector using training tool.The present invention also provides a kind of web page contents identification device and servers.The visual signature of Web page area is converted to the feature vector that training tool can learn by web page contents recognition methods, device and the server of the present invention, so as to generate content recognition model using training tool, and then can improve the efficiency for identifying web page contents, accuracy.
Description
Technical field
The present invention relates to a kind of Internet technical field more particularly to a kind of web page contents recognition methods, device, services
Device.
Background technology
At present, with the rapid development of internet, the information content on network increases severely, to the content of webpage be identified will
Ask also increasingly urgent.
Existing web page contents recognition methods is directed to the visual signature of web page contents, and net is obtained by the way of sample statistics
The recognition rule of page content, and the method needs the recognition rule of ceaselessly feedback adjustment web page contents, the training time is long, therefore,
Recognition efficiency is low and accuracy is not high.
Invention content
In view of this, the present invention provides a kind of web page contents recognition methods, device, server, can improve in identification webpage
The efficiency of appearance, accuracy.
An embodiment of the present invention provides a kind of web page contents recognition methods, determine at least one trained website, and each
The multiple trained webpages of acquisition in training website;It is special to obtain the vision of the corresponding block of content being selected in each training webpage
Sign;Data processing is carried out to the visual signature and obtains feature vector;And it is built using training tool according to described eigenvector
Found the identification model of the chosen content.
The present invention also provides a kind of web page contents identification device, including data acquisition module, visual signature acquisition module,
Data processing module, model building module.Data acquisition module is used to determine at least one trained website, and in each training stage
The multiple trained webpages of acquisition in point.Visual signature acquisition module is corresponding for obtaining the content being selected in each trained webpage
Visual signature.Data processing module is used to obtain feature vector to visual signature progress data processing.Model building module
For the identification model of the chosen content to be established according to described eigenvector using training tool.
The present invention also provides a kind of server, including web page contents identification device.Web page contents identification device includes data
Acquisition module, visual signature acquisition module, data processing module, model building module.Data acquisition module is used to determine at least
One trained website, and multiple trained webpages are acquired in each trained website.Visual signature acquisition module is each for obtaining
The corresponding visual signature of content being selected in training webpage.Data processing module is used to carry out at data the visual signature
Reason obtains feature vector.Model building module be used for using training tool according to described eigenvector establish it is described it is chosen in
The identification model of appearance.
The visual signature of Web page area is converted to trained work by web page contents recognition methods, device and the server of the present invention
Have the feature vector that can learn, so as to generate content recognition model using training tool, and then identification web page contents can be improved
Efficiency, accuracy.
Above and other objects, features and advantages to allow the present invention can be clearer and more comprehensible, preferred embodiment cited below particularly,
And coordinate institute's accompanying drawings, it is described in detail below.
Description of the drawings
Fig. 1 shows a kind of structure diagram of server;
Fig. 2 is the flow diagram of the web page contents recognition methods of first embodiment of the invention;
Fig. 3 is the flow diagram of the web page contents recognition methods of second embodiment of the invention;
Fig. 4 is the interface schematic diagram of web page contents recognition methods as shown in Figure 3;
Fig. 5 is the flow diagram of the web page contents recognition methods of third embodiment of the invention;
Fig. 6 is the structure diagram of the web page contents identification device of fourth embodiment of the invention;
Fig. 7 is the structure diagram of the web page contents identification device of fifth embodiment of the invention;
Fig. 8 is the structure diagram of the server of sixth embodiment of the invention.
Specific embodiment
Below in conjunction with the attached drawing in the embodiment of the present invention, the technical solution in the embodiment of the present invention is carried out clear, complete
Site preparation describes, it is clear that described embodiment is only part of the embodiment of the present invention, instead of all the embodiments.It is based on
Embodiment in the present invention, those of ordinary skill in the art are obtained every other without making creative work
Embodiment shall fall within the protection scope of the present invention.
The web page contents recognition methods that various embodiments of the present invention are provided, can be applied to server as shown in Figure 1.Such as figure
Shown in 1, server includes:Memory 101, processor 102 and network module 103.
It is appreciated that structure shown in FIG. 1 is only to illustrate, server may also include more or less than shown in Fig. 1
Component or with the configuration different from shown in Fig. 1.Hardware, software or combination may be used in each component shown in Fig. 1
It realizes.In addition, the server in the embodiment of the present invention can also include the server of multiple specific different function.
Memory 101 can be used for storage software program and module, such as the web page contents identification side in the embodiment of the present invention
Method and the corresponding program instruction/module of system, processor 102 by operation be stored in software program in memory 101 and
Module so as to perform various functions application and data processing, that is, realizes the web page contents recognition methods in the embodiment of the present invention
And system.Memory 101 may include high speed random access memory, may also include nonvolatile memory, such as one or more magnetic
Property storage device, flash memory or other non-volatile solid state memories.In some instances, memory 101 can further comprise
Relative to the remotely located memory of processor 102, these remote memories can pass through network connection to server.Further
Ground, above-mentioned software program and module may also include:Operating system 121 and service module 122.Wherein operating system 121, example
It can be such as LINUX, UNIX, WINDOWS, may include various for managing system task (such as memory management, storage device control
System, power management etc.) component software and/or driving, and can mutually be communicated with various hardware or component software, so as to provide it
The running environment of his component software.Service module 122 is operated on the basis of operating system 121, and passes through operating system 121
Network service monitors the request for carrying out automatic network, and corresponding data processing is completed, and return to handling result to terminal according to request.
That is service module 122 is used to provide the terminal with network service.
Network module 103 is used to receive and transmit network signal.Above-mentioned network signal may include wireless signal or have
Line signal.In an example, above-mentioned network signal is cable network signal.At this point, network module 103 may include processor,
The elements such as random access memory, converter, crystal oscillator.
First embodiment
Fig. 2 is the flow chart of web page contents recognition methods that first embodiment of the invention provides.The present embodiment is server
Pass through the web page contents recognition methods performed by network.As shown in Fig. 2, the web page contents recognition methods of the present embodiment may include with
Lower step:
Step S21:It determines at least one trained website, and multiple trained webpages is acquired in each trained website;
Specifically, it such as can be, but not limited to determine training that each trained website acquires according to the popularity of training website
The quantity of webpage, the quantity of the training webpage of more popular website acquisition is more, so that training tool can learn to access
The corresponding visual signature of content of big webpage is measured, and then increases the accuracy rate of webpage identification.
Step S22:Obtain the visual signature of the corresponding block of content being selected in each training webpage;
Specifically, the visual signature of block is that can represent the main feature of the Web page area vision level, can be with
But it is not limited to the length of block, block font size, web page tag etc..
Step S23:Data processing is carried out to visual signature and obtains feature vector;
It can be trained to the feature vector of tool identification in order to obtain, visual signature need to be handled.Specifically, if vision
Feature includes numeric type feature, then one is accounted in vector and represents a kind of numeric type feature.Can be specifically:For each number
Value type feature carries out numerical statistic, then equivalent is divided into several pieces, such as 10 parts, is respectively mapped to 0~0.1,0.1~0.2,
0.2~0.3,0.3~0.4,0.4~0.5,0.5~0.6,0.6~0.7,0.7~0.8,0.8~0.9,0.9~1.0 this 10
In section.
If visual signature includes nonumeric type feature, represented with lateral one-hot representat ion patterns
Nonumeric type feature.Wherein, one-hot representat ion are a kind of simplest term vector representations, i.e., with one
A long vector represents a word, and vectorial length is the size of dictionary, only there are one " 1 " for vectorial component, other to be all
" 0 ", the position of " 1 " correspond to position of the word in dictionary.
Step S24:The identification model of chosen content is established according to feature vector using training tool.
Specifically, training tool can be, but not limited to decision tree (the Gradient Boosting Decision for iteration
Tree, GBDT) training tool, or other machine training tools such as linear regression training tool.
Specifically, the identification model that chosen content is established according to feature vector is to establish the feature vector and net of webpage
Correspondence of the page content for example between title, price etc..
The visual signature of Web page area is converted to the spy that training tool can learn by the web page contents recognition methods of the present invention
Sign vector, so as to generate content recognition model using training tool, so as to improve the efficiency of identification web page contents, accuracy.
Second embodiment
Fig. 3 is the flow diagram of the web page contents recognition methods of second embodiment of the invention.Fig. 4 is as shown in Figure 3
The interface schematic diagram of web page contents recognition methods.Fig. 3 and Fig. 4 are please also refer to, web page contents recognition methods includes:
Step S221:The content of selected training webpage domestic demand mark;
As shown in figure 4, content such as title 40 etc. of training webpage domestic demand mark can be selected manually.
Step S222:Parse the XPath of content that need to be marked;
Specifically, when 41 buttons of preview XPath receive trigger signal, marking program will parse its XPath, and
XPath is shown in XPath display areas 42, certain marking program can also automatic trigger parse it is straight after its Xpath
Backstage is given in sending and receiving.
Specifically, when needing to be labeled plurality of kinds of contents, the attribute example in 43 input content of attribute input area is needed
It is stored such as " title ", and by the attribute of content is corresponding with its XPath.
Step S223:The visual signature of the chosen corresponding block of content is searched according to XPath.
Specifically, due to the XPath of block each in webpage be it is unique, according to the XPath for the content that need to be marked
The whole visual signatures of correspondence block stored after parsing can be found.
In the specific implementation, webkit has parsing cascading style sheets as a kernel without interface browser
Therefore (Cascading Style Sheets, CSS) and the function of rendering interface automatically, can utilize the above-mentioned work(of webkit
The visual information of corresponding block can be extracted, the method for recycling Feature Engineering is processed visual information, depending on
It is stored after feeling feature, in case searching.
It is selected wherein it is possible to can be obtained in each trained webpage by above-mentioned steps S221, step S222, step S223
The corresponding block of content visual signature, the content such as title of trained webpage domestic demand mark can also be selected, then directly
Parsing obtains the visual signature of the corresponding block of selected content.
The web page contents recognition methods of the present invention obtains the chosen corresponding area of content according to the XPath of chosen content
The visual signature of block, and the visual signature of Web page area is converted into the feature vector that training tool can learn, so as to utilize instruction
Practice tool generation content recognition model, so as to further improve the efficiency of identification web page contents, accuracy.
3rd embodiment
Fig. 5 is the flow diagram of the web page contents recognition methods of third embodiment of the invention.The present embodiment is server
Pass through the web page contents recognition methods performed by network.As shown in figure 5, the web page contents recognition methods of the present embodiment may include with
Lower step:
Step S51:It determines at least one trained website, and multiple trained webpages is acquired in each trained website;
Specifically, it such as can be, but not limited to determine training that each trained website acquires according to the popularity of training website
The quantity of webpage, the quantity of the training webpage of more popular website acquisition is more, so that training tool can learn to access
The corresponding visual signature of content of big webpage is measured, and then increases the accuracy rate of webpage identification.
Step S52:Obtain the visual signature of the corresponding block of content being selected in each training webpage;
Specifically, the visual signature of block is that can represent the main feature of the Web page area vision level, can be with
But it is not limited to the length of block, block font size, web page tag etc..
Step S53:Data processing is carried out to visual signature and obtains feature vector;
It can be trained to the feature vector of tool identification in order to obtain, visual signature need to be handled.Specifically, if vision
Feature includes numeric type feature, then one is accounted in vector and represents a kind of numeric type feature.Can be specifically:For each number
Value type feature carries out numerical statistic, then equivalent is divided into several pieces, such as 10 parts, is respectively mapped to 0~0.1,0.1~0.2,
0.2~0.3,0.3~0.4,0.4~0.5,0.5~0.6,0.6~0.7,0.7~0.8,0.8~0.9,0.9~1.0 this 10
In section.
If visual signature includes nonumeric type feature, represent non-with lateral one-hot representation patterns
Numeric type feature.Wherein, one-hot representation are a kind of simplest term vector representations, i.e., long with one
Vector represents a word, and vectorial length is the size of dictionary, only there are one " 1 ", other to be all " 0 ", " 1 " for vectorial component
Position correspond to position of the word in dictionary.
Step S54:The identification model of chosen content is established according to feature vector using training tool;
Specifically, training tool can be, but not limited to decision tree (the Gradient Boosting Decision for iteration
Tree, GBDT) training tool, or other machine training tools such as linear regression training tool.
Specifically, the identification model that chosen content is established according to feature vector is to establish the feature vector and net of webpage
Correspondence of the page content for example between title, price etc..
Step S55:The signature identification of webpage is received, and webpage to be identified is found according to signature identification;
Wherein, signature identification can be specifically uniform resource locator (Uniform Resource Locator, URL) or
Title etc., signature identification are used for one webpage of unique mark.
In the specific implementation, can be the feature mark that user submits webpage to be identified by the interactive interface of offer to server
Knowledge or other servers, business platform etc. submit the signature identification of webpage to be identified to server.It can be to server
The signature identification of a webpage to be identified is once submitted, the feature mark of multiple webpages to be identified can also be once submitted to server
Know to carry out batch processing, server feature based mark determines that the webpage to be identified of content recognition need to be carried out.
Step S56:The visual signature of all blocks of webpage to be identified is converted into feature vector;
Step S57:Gone out in webpage to be identified accordingly according to the eigenvector recognition of webpage to be identified using identification model
The XPath of content.
Specifically, if identification model includes the plurality of kinds of contents such as feature vector of title, price and the pass of its XPath
System, then input the attribute such as " title " of corresponding contents to identify the XPath of title using identification model.
Preferably, web page contents recognition methods still further comprises:
Step S58:The corresponding contents of webpage to be identified are extracted according to the XPath of content corresponding in webpage to be identified.
Specifically, the corresponding contents for extracting webpage to be identified can be, but not limited to the data as statistical analysis and for example extract
The title and price of webpage to be identified can detect upward price trend of commodity etc..
The visual signature of Web page area is divided into numeric type and nonumeric type feature by the web page contents recognition methods of the present invention
It is converted respectively, to generate the feature vector that training tool can learn, so as to generate content recognition model using training tool,
And then content recognition is carried out using identification model, efficiency, the accuracy of identification web page contents can be further improved.
Fourth embodiment
Fig. 6 is the structure diagram of the web page contents identification device of fourth embodiment of the invention, as shown in fig. 6, in webpage
Hold identification device 60 and include data acquisition module 601, visual signature acquisition module 602, data processing module 603, model foundation
Module 604.
Wherein, data acquisition module 601 is for determining at least one trained website, and acquisition is more in each trained website
A trained webpage.Visual signature acquisition module 602 is used to obtain the corresponding vision spy of the content being selected in each trained webpage
Sign.Data processing module 603 is used to obtain feature vector to visual signature progress data processing.Model building module 604 is used for
The identification model of chosen content is established according to feature vector using training tool.
Specifically, data acquisition module 601 determines the training of each trained website acquisition according to the popularity of training website
The quantity of webpage.
The visual signature of Web page area is converted to the spy that training tool can learn by the web page contents identification device of the present invention
Sign vector so as to generate content recognition model using training tool, and then carries out content recognition using identification model, can improve knowledge
The efficiency of other web page contents, accuracy.
5th embodiment
Fig. 7 is the structure diagram of the web page contents identification device of fifth embodiment of the invention.As shown in fig. 7, in webpage
Hold identification device 70 and include data acquisition module 701, visual signature acquisition module 702, data processing module 703, model foundation
Module 704.
Preferably, visual signature acquisition module 702 includes selected unit 712, resolution unit 722, acquiring unit 732.Choosing
The content that order member 712 marks for selected training webpage domestic demand.The content that resolution unit 722 need to mark for parsing
XPath.Acquiring unit 732 is used to search the visual signature of the chosen corresponding block of content according to XPath.
Preferably, data processing module 703 includes numeric type characteristic processing unit 713, and being used for will be in visual signature
Numeric type feature accounts for an expression in vector.
Preferably, data processing module 703 includes nonumeric type characteristic processing unit 723, and being used for will be in visual signature
Nonumeric type feature represented with lateral one-hot representation patterns.
Preferably, web page contents identification device 70 further includes identification module (not shown), is used to receive webpage
Signature identification, and webpage to be identified is found, and the visual signature of all blocks of webpage to be identified is turned according to signature identification
After being changed to feature vector, gone out in webpage to be identified in corresponding according to the eigenvector recognition of webpage to be identified using identification model
The XPath of appearance.
The visual signature of Web page area is divided into numeric type and nonumeric type feature by the web page contents identification device of the present invention
It is converted respectively, to generate the feature vector that training tool can learn, so as to generate content recognition model using training tool,
And then content recognition is carried out using identification model, efficiency, the accuracy of identification web page contents can be further improved.
Sixth embodiment
Fig. 8 is the structure diagram of the server of sixth embodiment of the invention.Shown in Fig. 8, server 80 is included in webpage
Hold identification device.The concrete structure of web page contents identification device please refers to Fig. 6 or Fig. 7.Wherein, the structure of server can also join
Fig. 1 is examined, details are not described herein.
The visual signature of Web page area is converted to trained work by web page contents recognition methods, device and the server of the present invention
Has the feature vector that can learn, so as to generate content recognition model using training tool, so as to improve identification web page contents
Efficiency, accuracy.
It should be noted that each embodiment in this specification is described by the way of progressive, each embodiment weight
Point explanation is all difference from other examples, and just to refer each other for identical similar part between each embodiment.
For device class embodiment, since it is basicly similar to embodiment of the method, so description is fairly simple, related part is joined
See the part explanation of embodiment of the method.
It should be noted that herein, relational terms such as first and second and the like are used merely to a reality
Body or operation are distinguished with another entity or operation, are deposited without necessarily requiring or implying between these entities or operation
In any this practical relationship or sequence.Moreover, term " comprising ", "comprising" or its any other variant are intended to
Non-exclusive inclusion, so that process, method, article or device including a series of elements not only will including those
Element, but also including other elements that are not explicitly listed or further include as this process, method, article or device
Intrinsic element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that
Also there are other identical elements in process, method, article or device including element.
One of ordinary skill in the art will appreciate that hardware can be passed through by realizing all or part of step of above-described embodiment
Complete, relevant hardware can also be instructed to complete by program, program can be stored in a kind of computer-readable storage
In medium, storage medium mentioned above can be read-only memory, disk or CD etc..
More than, only it is presently preferred embodiments of the present invention, not makees limitation in any form to the present invention, although this
Invention is disclosed above with preferred embodiment, however is not limited to the present invention, any person skilled in the art,
It does not depart from the range of technical solution of the present invention, when the technology contents using the disclosure above make a little change or are modified to equivalent
The equivalent embodiment of variation, as long as being without departing from technical solution of the present invention content, technical spirit according to the present invention is to above real
Any simple modification, equivalent change and modification that example is made is applied, in the range of still falling within technical solution of the present invention.
Claims (16)
1. a kind of web page contents recognition methods, which is characterized in that the web page contents recognition methods includes:
It determines at least one trained website, and multiple trained webpages is acquired in each trained website;
Obtain the visual signature of the corresponding block of content being selected in each training webpage;
Data processing is carried out to the visual signature and obtains feature vector;And
The identification model of the chosen content is established according to described eigenvector using training tool.
2. web page contents recognition methods as described in claim 1, which is characterized in that determine at least one trained website, and
The step of acquiring multiple trained webpages in each training website includes:
The quantity of the training webpage of each trained website acquisition is determined according to the popularity of the trained website.
3. web page contents recognition methods as described in claim 1, which is characterized in that obtain what is be selected in each training webpage
The step of visual signature of the corresponding block of content, includes:
The content of selected training webpage domestic demand mark;
Parse the XPath of the content that need to be marked;And
The visual signature of the chosen corresponding block of content is searched according to the XPath.
4. web page contents recognition methods as described in claim 1, which is characterized in that it is special that the visual signature includes numeric type
Sign;
The step of data processing obtains feature vector is carried out to the visual signature to include:
One is accounted in vector and represents a kind of numeric type feature.
5. web page contents recognition methods as described in claim 1, which is characterized in that it is special that the visual signature includes nonumeric type
Sign;
The step of data processing obtains feature vector is carried out to the visual signature to include:
The nonumeric type feature is represented with lateral one-hot representation patterns.
6. web page contents recognition methods as described in claim 1, which is characterized in that the training tool trains work for GBDT
Tool.
7. web page contents recognition methods as described in claim 1, which is characterized in that the web page contents recognition methods is also wrapped
It includes:
The signature identification of webpage is received, and webpage to be identified is found according to the signature identification;
The visual signature of all blocks of the webpage to be identified is converted into feature vector;And
Corresponding content in webpage to be identified is gone out according to the eigenvector recognition of the webpage to be identified using identification model
XPath。
8. web page contents recognition methods as claimed in claim 7, which is characterized in that the web page contents recognition methods is also wrapped
It includes:
The corresponding contents of the webpage to be identified are extracted according to the XPath of content corresponding in the webpage to be identified.
9. a kind of web page contents identification device, which is characterized in that the web page contents identification device includes:
Data acquisition module for determining at least one trained website, and acquires multiple trained webpages in each trained website;
Visual signature acquisition module, for obtaining the corresponding visual signature of content being selected in each trained webpage;
Data processing module obtains feature vector for carrying out data processing to the visual signature;And
Model building module, for establishing the identification mould of the chosen content according to described eigenvector using training tool
Type.
10. web page contents identification device as claimed in claim 9, which is characterized in that the data acquisition module is according to
The popularity of training website determines the quantity of the training webpage of each trained website acquisition.
11. web page contents identification device as claimed in claim 9, which is characterized in that the visual signature acquisition module includes:
Selected unit, for the content of selected training webpage domestic demand mark;
Resolution unit, for parsing the XPath of the content that need to be marked;And
Acquiring unit, for searching the visual signature of the chosen corresponding block of content according to the XPath.
12. web page contents identification device as claimed in claim 9, which is characterized in that the data processing module includes:
Numeric type characteristic processing unit, for the numeric type feature in the visual signature to be accounted for an expression in vector.
13. web page contents identification device as claimed in claim 9, which is characterized in that the data processing module includes:
Nonumeric type characteristic processing unit, for by the nonumeric type feature in the visual signature with lateral one-hot
Representation patterns represent.
14. web page contents identification device as claimed in claim 9, which is characterized in that the model building module utilizes GBDT
Training tool establishes the identification model of the chosen content according to described eigenvector.
15. web page contents identification device as claimed in claim 9, which is characterized in that the web page contents identification device also wraps
It includes:
Identification module for receiving the signature identification of webpage, and finds webpage to be identified according to the signature identification, and by institute
State all blocks of webpage to be identified visual signature be converted to feature vector after, using identification model according to the net to be identified
The eigenvector recognition of page goes out the XPath of corresponding content in webpage to be identified.
16. a kind of server, which is characterized in that identify dress including the web page contents as described in claim 9 to 15 any one
It puts.
Priority Applications (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611170430.3A CN108205544A (en) | 2016-12-16 | 2016-12-16 | Web page contents recognition methods, device, server |
PCT/CN2017/112866 WO2018103540A1 (en) | 2016-12-09 | 2017-11-24 | Webpage content extraction method, device, and data storage medium |
US16/359,224 US11074306B2 (en) | 2016-12-09 | 2019-03-20 | Web content extraction method, device, storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611170430.3A CN108205544A (en) | 2016-12-16 | 2016-12-16 | Web page contents recognition methods, device, server |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108205544A true CN108205544A (en) | 2018-06-26 |
Family
ID=62602719
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201611170430.3A Pending CN108205544A (en) | 2016-12-09 | 2016-12-16 | Web page contents recognition methods, device, server |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108205544A (en) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8725751B1 (en) * | 2008-08-28 | 2014-05-13 | Trend Micro Incorporated | Method and apparatus for blocking or blurring unwanted images |
CN103942224A (en) * | 2013-01-23 | 2014-07-23 | 百度在线网络技术(北京)有限公司 | Method and device for acquiring annotation rule of webpage blocks |
CN105550278A (en) * | 2015-12-10 | 2016-05-04 | 天津海量信息技术有限公司 | Webpage region recognition algorithm based on deep learning |
CN106156236A (en) * | 2014-10-28 | 2016-11-23 | 李光耀 | Vision web page analysis System and method for |
-
2016
- 2016-12-16 CN CN201611170430.3A patent/CN108205544A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8725751B1 (en) * | 2008-08-28 | 2014-05-13 | Trend Micro Incorporated | Method and apparatus for blocking or blurring unwanted images |
CN103942224A (en) * | 2013-01-23 | 2014-07-23 | 百度在线网络技术(北京)有限公司 | Method and device for acquiring annotation rule of webpage blocks |
CN106156236A (en) * | 2014-10-28 | 2016-11-23 | 李光耀 | Vision web page analysis System and method for |
CN105550278A (en) * | 2015-12-10 | 2016-05-04 | 天津海量信息技术有限公司 | Webpage region recognition algorithm based on deep learning |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105027121B (en) | The five application page of the machine application is indexed | |
US8458207B2 (en) | Using anchor text to provide context | |
US8589366B1 (en) | Data extraction using templates | |
CN111026937B (en) | Method, device and equipment for extracting POI name and computer storage medium | |
US11561988B2 (en) | Systems and methods for harvesting data associated with fraudulent content in a networked environment | |
CN103294781A (en) | Method and equipment used for processing page data | |
CN104615767A (en) | Searching-ranking model training method and device and search processing method | |
CN104462611A (en) | Modeling method, ranking method, modeling device and ranking device for information ranking model | |
CN101534306A (en) | Detecting method and a device for fishing website | |
CN104011716A (en) | Providing knowledge panels with search results | |
CN105045901A (en) | Search keyword push method and device | |
CN109033220B (en) | Automatic selection method, system, equipment and storage medium of labeled data | |
CN102708168A (en) | System and method for sorting search results of teaching resources | |
CN107807957A (en) | entity library generating method and device | |
CN109903076A (en) | A kind of ad data generation method, system, electronic equipment and storage medium | |
CN102609539B (en) | Search method and search system | |
CN110309049A (en) | Web page contents monitor method, device, computer equipment and storage medium | |
CN104657474A (en) | Advertisement display method, advertisement inquiring server and client side | |
CN106446123A (en) | Webpage verification code element identification method | |
CN103885767A (en) | System and method used for geographical area correlated websites | |
US11074306B2 (en) | Web content extraction method, device, storage medium | |
CN110262906B (en) | Interface label recommendation method and device, storage medium and electronic equipment | |
CN108205544A (en) | Web page contents recognition methods, device, server | |
CN105912573A (en) | Data updating method and data updating device | |
CN115186240A (en) | Social network user alignment method, device and medium based on relevance information |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |