CN108205544A

CN108205544A - Web page contents recognition methods, device, server

Info

Publication number: CN108205544A
Application number: CN201611170430.3A
Authority: CN
Inventors: 赵铭鑫; 卓居超
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2016-12-16
Filing date: 2016-12-16
Publication date: 2018-06-26

Abstract

The invention discloses a kind of web page contents recognition methods, the web page contents recognition methods includes determining at least one trained website, and acquire multiple trained webpages in each trained website；Obtain the visual signature of the corresponding block of content being selected in each training webpage；Data processing is carried out to the visual signature and obtains feature vector；And the identification model of the chosen content is established according to described eigenvector using training tool.The present invention also provides a kind of web page contents identification device and servers.The visual signature of Web page area is converted to the feature vector that training tool can learn by web page contents recognition methods, device and the server of the present invention, so as to generate content recognition model using training tool, and then can improve the efficiency for identifying web page contents, accuracy.

Description

Web page contents recognition methods, device, server

Technical field

The present invention relates to a kind of Internet technical field more particularly to a kind of web page contents recognition methods, device, services Device.

Background technology

At present, with the rapid development of internet, the information content on network increases severely, to the content of webpage be identified will Ask also increasingly urgent.

Existing web page contents recognition methods is directed to the visual signature of web page contents, and net is obtained by the way of sample statistics The recognition rule of page content, and the method needs the recognition rule of ceaselessly feedback adjustment web page contents, the training time is long, therefore, Recognition efficiency is low and accuracy is not high.

Invention content

In view of this, the present invention provides a kind of web page contents recognition methods, device, server, can improve in identification webpage The efficiency of appearance, accuracy.

An embodiment of the present invention provides a kind of web page contents recognition methods, determine at least one trained website, and each The multiple trained webpages of acquisition in training website；It is special to obtain the vision of the corresponding block of content being selected in each training webpage Sign；Data processing is carried out to the visual signature and obtains feature vector；And it is built using training tool according to described eigenvector Found the identification model of the chosen content.

The present invention also provides a kind of web page contents identification device, including data acquisition module, visual signature acquisition module, Data processing module, model building module.Data acquisition module is used to determine at least one trained website, and in each training stage The multiple trained webpages of acquisition in point.Visual signature acquisition module is corresponding for obtaining the content being selected in each trained webpage Visual signature.Data processing module is used to obtain feature vector to visual signature progress data processing.Model building module For the identification model of the chosen content to be established according to described eigenvector using training tool.

The present invention also provides a kind of server, including web page contents identification device.Web page contents identification device includes data Acquisition module, visual signature acquisition module, data processing module, model building module.Data acquisition module is used to determine at least One trained website, and multiple trained webpages are acquired in each trained website.Visual signature acquisition module is each for obtaining The corresponding visual signature of content being selected in training webpage.Data processing module is used to carry out at data the visual signature Reason obtains feature vector.Model building module be used for using training tool according to described eigenvector establish it is described it is chosen in The identification model of appearance.

The visual signature of Web page area is converted to trained work by web page contents recognition methods, device and the server of the present invention Have the feature vector that can learn, so as to generate content recognition model using training tool, and then identification web page contents can be improved Efficiency, accuracy.

Above and other objects, features and advantages to allow the present invention can be clearer and more comprehensible, preferred embodiment cited below particularly, And coordinate institute's accompanying drawings, it is described in detail below.

Description of the drawings

Fig. 1 shows a kind of structure diagram of server；

Fig. 2 is the flow diagram of the web page contents recognition methods of first embodiment of the invention；

Fig. 3 is the flow diagram of the web page contents recognition methods of second embodiment of the invention；

Fig. 4 is the interface schematic diagram of web page contents recognition methods as shown in Figure 3；

Fig. 5 is the flow diagram of the web page contents recognition methods of third embodiment of the invention；

Fig. 6 is the structure diagram of the web page contents identification device of fourth embodiment of the invention；

Fig. 7 is the structure diagram of the web page contents identification device of fifth embodiment of the invention；

Fig. 8 is the structure diagram of the server of sixth embodiment of the invention.

Specific embodiment

Below in conjunction with the attached drawing in the embodiment of the present invention, the technical solution in the embodiment of the present invention is carried out clear, complete Site preparation describes, it is clear that described embodiment is only part of the embodiment of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, those of ordinary skill in the art are obtained every other without making creative work Embodiment shall fall within the protection scope of the present invention.

The web page contents recognition methods that various embodiments of the present invention are provided, can be applied to server as shown in Figure 1.Such as figure Shown in 1, server includes：Memory 101, processor 102 and network module 103.

It is appreciated that structure shown in FIG. 1 is only to illustrate, server may also include more or less than shown in Fig. 1 Component or with the configuration different from shown in Fig. 1.Hardware, software or combination may be used in each component shown in Fig. 1 It realizes.In addition, the server in the embodiment of the present invention can also include the server of multiple specific different function.

Memory 101 can be used for storage software program and module, such as the web page contents identification side in the embodiment of the present invention Method and the corresponding program instruction/module of system, processor 102 by operation be stored in software program in memory 101 and Module so as to perform various functions application and data processing, that is, realizes the web page contents recognition methods in the embodiment of the present invention And system.Memory 101 may include high speed random access memory, may also include nonvolatile memory, such as one or more magnetic Property storage device, flash memory or other non-volatile solid state memories.In some instances, memory 101 can further comprise Relative to the remotely located memory of processor 102, these remote memories can pass through network connection to server.Further Ground, above-mentioned software program and module may also include：Operating system 121 and service module 122.Wherein operating system 121, example It can be such as LINUX, UNIX, WINDOWS, may include various for managing system task (such as memory management, storage device control System, power management etc.) component software and/or driving, and can mutually be communicated with various hardware or component software, so as to provide it The running environment of his component software.Service module 122 is operated on the basis of operating system 121, and passes through operating system 121 Network service monitors the request for carrying out automatic network, and corresponding data processing is completed, and return to handling result to terminal according to request. That is service module 122 is used to provide the terminal with network service.

Network module 103 is used to receive and transmit network signal.Above-mentioned network signal may include wireless signal or have Line signal.In an example, above-mentioned network signal is cable network signal.At this point, network module 103 may include processor, The elements such as random access memory, converter, crystal oscillator.

First embodiment

Fig. 2 is the flow chart of web page contents recognition methods that first embodiment of the invention provides.The present embodiment is server Pass through the web page contents recognition methods performed by network.As shown in Fig. 2, the web page contents recognition methods of the present embodiment may include with Lower step：

Step S21：It determines at least one trained website, and multiple trained webpages is acquired in each trained website；

Specifically, it such as can be, but not limited to determine training that each trained website acquires according to the popularity of training website The quantity of webpage, the quantity of the training webpage of more popular website acquisition is more, so that training tool can learn to access The corresponding visual signature of content of big webpage is measured, and then increases the accuracy rate of webpage identification.

Step S22：Obtain the visual signature of the corresponding block of content being selected in each training webpage；

Specifically, the visual signature of block is that can represent the main feature of the Web page area vision level, can be with But it is not limited to the length of block, block font size, web page tag etc..

Step S23：Data processing is carried out to visual signature and obtains feature vector；

It can be trained to the feature vector of tool identification in order to obtain, visual signature need to be handled.Specifically, if vision Feature includes numeric type feature, then one is accounted in vector and represents a kind of numeric type feature.Can be specifically：For each number Value type feature carries out numerical statistic, then equivalent is divided into several pieces, such as 10 parts, is respectively mapped to 0~0.1,0.1~0.2, 0.2~0.3,0.3~0.4,0.4~0.5,0.5~0.6,0.6~0.7,0.7~0.8,0.8~0.9,0.9~1.0 this 10 In section.

If visual signature includes nonumeric type feature, represented with lateral one-hot representat ion patterns Nonumeric type feature.Wherein, one-hot representat ion are a kind of simplest term vector representations, i.e., with one A long vector represents a word, and vectorial length is the size of dictionary, only there are one " 1 " for vectorial component, other to be all " 0 ", the position of " 1 " correspond to position of the word in dictionary.

Step S24：The identification model of chosen content is established according to feature vector using training tool.

Specifically, training tool can be, but not limited to decision tree (the Gradient Boosting Decision for iteration Tree, GBDT) training tool, or other machine training tools such as linear regression training tool.

Specifically, the identification model that chosen content is established according to feature vector is to establish the feature vector and net of webpage Correspondence of the page content for example between title, price etc..

The visual signature of Web page area is converted to the spy that training tool can learn by the web page contents recognition methods of the present invention Sign vector, so as to generate content recognition model using training tool, so as to improve the efficiency of identification web page contents, accuracy.

Second embodiment

Fig. 3 is the flow diagram of the web page contents recognition methods of second embodiment of the invention.Fig. 4 is as shown in Figure 3 The interface schematic diagram of web page contents recognition methods.Fig. 3 and Fig. 4 are please also refer to, web page contents recognition methods includes：

Step S221：The content of selected training webpage domestic demand mark；

As shown in figure 4, content such as title 40 etc. of training webpage domestic demand mark can be selected manually.

Step S222：Parse the XPath of content that need to be marked；

Specifically, when 41 buttons of preview XPath receive trigger signal, marking program will parse its XPath, and XPath is shown in XPath display areas 42, certain marking program can also automatic trigger parse it is straight after its Xpath Backstage is given in sending and receiving.

Specifically, when needing to be labeled plurality of kinds of contents, the attribute example in 43 input content of attribute input area is needed It is stored such as " title ", and by the attribute of content is corresponding with its XPath.

Step S223：The visual signature of the chosen corresponding block of content is searched according to XPath.

Specifically, due to the XPath of block each in webpage be it is unique, according to the XPath for the content that need to be marked The whole visual signatures of correspondence block stored after parsing can be found.

In the specific implementation, webkit has parsing cascading style sheets as a kernel without interface browser Therefore (Cascading Style Sheets, CSS) and the function of rendering interface automatically, can utilize the above-mentioned work(of webkit The visual information of corresponding block can be extracted, the method for recycling Feature Engineering is processed visual information, depending on It is stored after feeling feature, in case searching.

It is selected wherein it is possible to can be obtained in each trained webpage by above-mentioned steps S221, step S222, step S223 The corresponding block of content visual signature, the content such as title of trained webpage domestic demand mark can also be selected, then directly Parsing obtains the visual signature of the corresponding block of selected content.

The web page contents recognition methods of the present invention obtains the chosen corresponding area of content according to the XPath of chosen content The visual signature of block, and the visual signature of Web page area is converted into the feature vector that training tool can learn, so as to utilize instruction Practice tool generation content recognition model, so as to further improve the efficiency of identification web page contents, accuracy.

3rd embodiment

Fig. 5 is the flow diagram of the web page contents recognition methods of third embodiment of the invention.The present embodiment is server Pass through the web page contents recognition methods performed by network.As shown in figure 5, the web page contents recognition methods of the present embodiment may include with Lower step：

Step S51：It determines at least one trained website, and multiple trained webpages is acquired in each trained website；

Step S52：Obtain the visual signature of the corresponding block of content being selected in each training webpage；

Step S53：Data processing is carried out to visual signature and obtains feature vector；

If visual signature includes nonumeric type feature, represent non-with lateral one-hot representation patterns Numeric type feature.Wherein, one-hot representation are a kind of simplest term vector representations, i.e., long with one Vector represents a word, and vectorial length is the size of dictionary, only there are one " 1 ", other to be all " 0 ", " 1 " for vectorial component Position correspond to position of the word in dictionary.

Step S54：The identification model of chosen content is established according to feature vector using training tool；

Step S55：The signature identification of webpage is received, and webpage to be identified is found according to signature identification；

Wherein, signature identification can be specifically uniform resource locator (Uniform Resource Locator, URL) or Title etc., signature identification are used for one webpage of unique mark.

In the specific implementation, can be the feature mark that user submits webpage to be identified by the interactive interface of offer to server Knowledge or other servers, business platform etc. submit the signature identification of webpage to be identified to server.It can be to server The signature identification of a webpage to be identified is once submitted, the feature mark of multiple webpages to be identified can also be once submitted to server Know to carry out batch processing, server feature based mark determines that the webpage to be identified of content recognition need to be carried out.

Step S56：The visual signature of all blocks of webpage to be identified is converted into feature vector；

Step S57：Gone out in webpage to be identified accordingly according to the eigenvector recognition of webpage to be identified using identification model The XPath of content.

Specifically, if identification model includes the plurality of kinds of contents such as feature vector of title, price and the pass of its XPath System, then input the attribute such as " title " of corresponding contents to identify the XPath of title using identification model.

Preferably, web page contents recognition methods still further comprises：

Step S58：The corresponding contents of webpage to be identified are extracted according to the XPath of content corresponding in webpage to be identified.

Specifically, the corresponding contents for extracting webpage to be identified can be, but not limited to the data as statistical analysis and for example extract The title and price of webpage to be identified can detect upward price trend of commodity etc..

The visual signature of Web page area is divided into numeric type and nonumeric type feature by the web page contents recognition methods of the present invention It is converted respectively, to generate the feature vector that training tool can learn, so as to generate content recognition model using training tool, And then content recognition is carried out using identification model, efficiency, the accuracy of identification web page contents can be further improved.

Fourth embodiment

Fig. 6 is the structure diagram of the web page contents identification device of fourth embodiment of the invention, as shown in fig. 6, in webpage Hold identification device 60 and include data acquisition module 601, visual signature acquisition module 602, data processing module 603, model foundation Module 604.

Wherein, data acquisition module 601 is for determining at least one trained website, and acquisition is more in each trained website A trained webpage.Visual signature acquisition module 602 is used to obtain the corresponding vision spy of the content being selected in each trained webpage Sign.Data processing module 603 is used to obtain feature vector to visual signature progress data processing.Model building module 604 is used for The identification model of chosen content is established according to feature vector using training tool.

Specifically, data acquisition module 601 determines the training of each trained website acquisition according to the popularity of training website The quantity of webpage.

The visual signature of Web page area is converted to the spy that training tool can learn by the web page contents identification device of the present invention Sign vector so as to generate content recognition model using training tool, and then carries out content recognition using identification model, can improve knowledge The efficiency of other web page contents, accuracy.

5th embodiment

Fig. 7 is the structure diagram of the web page contents identification device of fifth embodiment of the invention.As shown in fig. 7, in webpage Hold identification device 70 and include data acquisition module 701, visual signature acquisition module 702, data processing module 703, model foundation Module 704.

Preferably, visual signature acquisition module 702 includes selected unit 712, resolution unit 722, acquiring unit 732.Choosing The content that order member 712 marks for selected training webpage domestic demand.The content that resolution unit 722 need to mark for parsing XPath.Acquiring unit 732 is used to search the visual signature of the chosen corresponding block of content according to XPath.

Preferably, data processing module 703 includes numeric type characteristic processing unit 713, and being used for will be in visual signature Numeric type feature accounts for an expression in vector.

Preferably, data processing module 703 includes nonumeric type characteristic processing unit 723, and being used for will be in visual signature Nonumeric type feature represented with lateral one-hot representation patterns.

Preferably, web page contents identification device 70 further includes identification module (not shown), is used to receive webpage Signature identification, and webpage to be identified is found, and the visual signature of all blocks of webpage to be identified is turned according to signature identification After being changed to feature vector, gone out in webpage to be identified in corresponding according to the eigenvector recognition of webpage to be identified using identification model The XPath of appearance.

The visual signature of Web page area is divided into numeric type and nonumeric type feature by the web page contents identification device of the present invention It is converted respectively, to generate the feature vector that training tool can learn, so as to generate content recognition model using training tool, And then content recognition is carried out using identification model, efficiency, the accuracy of identification web page contents can be further improved.

Sixth embodiment

Fig. 8 is the structure diagram of the server of sixth embodiment of the invention.Shown in Fig. 8, server 80 is included in webpage Hold identification device.The concrete structure of web page contents identification device please refers to Fig. 6 or Fig. 7.Wherein, the structure of server can also join Fig. 1 is examined, details are not described herein.

The visual signature of Web page area is converted to trained work by web page contents recognition methods, device and the server of the present invention Has the feature vector that can learn, so as to generate content recognition model using training tool, so as to improve identification web page contents Efficiency, accuracy.

It should be noted that each embodiment in this specification is described by the way of progressive, each embodiment weight Point explanation is all difference from other examples, and just to refer each other for identical similar part between each embodiment. For device class embodiment, since it is basicly similar to embodiment of the method, so description is fairly simple, related part is joined See the part explanation of embodiment of the method.

It should be noted that herein, relational terms such as first and second and the like are used merely to a reality Body or operation are distinguished with another entity or operation, are deposited without necessarily requiring or implying between these entities or operation In any this practical relationship or sequence.Moreover, term " comprising ", "comprising" or its any other variant are intended to Non-exclusive inclusion, so that process, method, article or device including a series of elements not only will including those Element, but also including other elements that are not explicitly listed or further include as this process, method, article or device Intrinsic element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that Also there are other identical elements in process, method, article or device including element.

One of ordinary skill in the art will appreciate that hardware can be passed through by realizing all or part of step of above-described embodiment Complete, relevant hardware can also be instructed to complete by program, program can be stored in a kind of computer-readable storage In medium, storage medium mentioned above can be read-only memory, disk or CD etc..

More than, only it is presently preferred embodiments of the present invention, not makees limitation in any form to the present invention, although this Invention is disclosed above with preferred embodiment, however is not limited to the present invention, any person skilled in the art, It does not depart from the range of technical solution of the present invention, when the technology contents using the disclosure above make a little change or are modified to equivalent The equivalent embodiment of variation, as long as being without departing from technical solution of the present invention content, technical spirit according to the present invention is to above real Any simple modification, equivalent change and modification that example is made is applied, in the range of still falling within technical solution of the present invention.

Claims

1. a kind of web page contents recognition methods, which is characterized in that the web page contents recognition methods includes：

It determines at least one trained website, and multiple trained webpages is acquired in each trained website；

Obtain the visual signature of the corresponding block of content being selected in each training webpage；

Data processing is carried out to the visual signature and obtains feature vector；And

The identification model of the chosen content is established according to described eigenvector using training tool.

2. web page contents recognition methods as described in claim 1, which is characterized in that determine at least one trained website, and The step of acquiring multiple trained webpages in each training website includes：

The quantity of the training webpage of each trained website acquisition is determined according to the popularity of the trained website.

3. web page contents recognition methods as described in claim 1, which is characterized in that obtain what is be selected in each training webpage The step of visual signature of the corresponding block of content, includes：

The content of selected training webpage domestic demand mark；

Parse the XPath of the content that need to be marked；And

The visual signature of the chosen corresponding block of content is searched according to the XPath.

4. web page contents recognition methods as described in claim 1, which is characterized in that it is special that the visual signature includes numeric type Sign；

The step of data processing obtains feature vector is carried out to the visual signature to include：

One is accounted in vector and represents a kind of numeric type feature.

5. web page contents recognition methods as described in claim 1, which is characterized in that it is special that the visual signature includes nonumeric type Sign；

The nonumeric type feature is represented with lateral one-hot representation patterns.

6. web page contents recognition methods as described in claim 1, which is characterized in that the training tool trains work for GBDT Tool.

7. web page contents recognition methods as described in claim 1, which is characterized in that the web page contents recognition methods is also wrapped It includes:

The signature identification of webpage is received, and webpage to be identified is found according to the signature identification；

The visual signature of all blocks of the webpage to be identified is converted into feature vector；And

Corresponding content in webpage to be identified is gone out according to the eigenvector recognition of the webpage to be identified using identification model XPath。

8. web page contents recognition methods as claimed in claim 7, which is characterized in that the web page contents recognition methods is also wrapped It includes：

The corresponding contents of the webpage to be identified are extracted according to the XPath of content corresponding in the webpage to be identified.

9. a kind of web page contents identification device, which is characterized in that the web page contents identification device includes：

Data acquisition module for determining at least one trained website, and acquires multiple trained webpages in each trained website；

Visual signature acquisition module, for obtaining the corresponding visual signature of content being selected in each trained webpage；

Data processing module obtains feature vector for carrying out data processing to the visual signature；And

Model building module, for establishing the identification mould of the chosen content according to described eigenvector using training tool Type.

10. web page contents identification device as claimed in claim 9, which is characterized in that the data acquisition module is according to The popularity of training website determines the quantity of the training webpage of each trained website acquisition.

11. web page contents identification device as claimed in claim 9, which is characterized in that the visual signature acquisition module includes：

Selected unit, for the content of selected training webpage domestic demand mark；

Resolution unit, for parsing the XPath of the content that need to be marked；And

Acquiring unit, for searching the visual signature of the chosen corresponding block of content according to the XPath.

12. web page contents identification device as claimed in claim 9, which is characterized in that the data processing module includes：

Numeric type characteristic processing unit, for the numeric type feature in the visual signature to be accounted for an expression in vector.

13. web page contents identification device as claimed in claim 9, which is characterized in that the data processing module includes：

Nonumeric type characteristic processing unit, for by the nonumeric type feature in the visual signature with lateral one-hot Representation patterns represent.

14. web page contents identification device as claimed in claim 9, which is characterized in that the model building module utilizes GBDT Training tool establishes the identification model of the chosen content according to described eigenvector.

15. web page contents identification device as claimed in claim 9, which is characterized in that the web page contents identification device also wraps It includes：

Identification module for receiving the signature identification of webpage, and finds webpage to be identified according to the signature identification, and by institute State all blocks of webpage to be identified visual signature be converted to feature vector after, using identification model according to the net to be identified The eigenvector recognition of page goes out the XPath of corresponding content in webpage to be identified.

16. a kind of server, which is characterized in that identify dress including the web page contents as described in claim 9 to 15 any one It puts.