CN106855859A

CN106855859A - A kind of webpage context extraction method and device

Info

Publication number: CN106855859A
Application number: CN201510897907.7A
Authority: CN
Inventors: 胡又欢; 卞维杰
Original assignee: Beijing Sogou Technology Development Co Ltd
Current assignee: Beijing Sogou Technology Development Co Ltd
Priority date: 2015-12-08
Filing date: 2015-12-08
Publication date: 2017-06-16
Anticipated expiration: 2035-12-08
Also published as: CN106855859B

Abstract

The invention discloses a kind of webpage context extraction method and device, it is compared by the text extracting information at least two target webs, and by comparative result in the text extracting information of at least two target web for identical nodal information confirms as webpage impurity, at least two target web belongs to same type webpage；The text message that impurity information filtering obtains the webpage is carried out according to the webpage impurity to the same type webpage.Due to the impurity information in the text extracting information that can determine that same type target web, and then according to the impurity information, the text extracting information to the same type of target web filter finally available more accurate text message.

Description

A kind of webpage context extraction method and device

Technical field

The present invention relates to Internet technical field, more particularly, it relates to a kind of Web page text is carried Take method and device.

Background technology

At present, Web page text is extracted the general extraction mode using based on template or is taken out based on word density The mode for taking, i.e., taken out by selecting fixed node or carry out text according to the node with text feature Take.General, the text extracting scheme based on node selection captures webpage by webpage capture device first Source code information, then by the source code information architecture DOM Document Object Model of webpage (DOM, Document Object Model) tree, then choose corresponding Node extraction and go out text message, such as certain The text viewing area of a little webpages can be fixed on a node, then only need to find this text node, so The text under this text node is taken out afterwards, but when the impurity information for needing to reject is tight with text message Solid matter is arranged and when under identical text node, and impurity information cannot then be rejected and obtain more smart by prior art True Web page text information.

The content of the invention

In view of the above problems, it is proposed that the embodiment of the present invention overcomes above mentioned problem or extremely to provide one kind A kind of Web page text extracting method and corresponding device for partially solving the above problems.

In order to solve the above-mentioned technical problem, a kind of webpage context extraction method provided in an embodiment of the present invention, It includes：

Text extracting information at least two target webs is compared, and by least two target Comparative result confirms as webpage impurity for identical nodal information in the text extracting information of webpage, it is described extremely Few two target webs belong to same type webpage；

Impurity information filtering is carried out according to the webpage impurity to the same type webpage and obtains the webpage Text message.

Wherein, the text extracting information at least two target webs is compared, and by described in extremely To confirm as webpage miscellaneous for identical nodal information for comparative result in few two text extracting information of target web Matter is specifically included；

The text extracting information for extracting first aim webpage is saved in the affiliated type of first aim webpage Initialized in corresponding database；

The text extracting information of next target web is extracted, and will wherein each child node information and the number It is compared according to each child node information in the text extracting information of the target web preserved in storehouse, it is described next Individual target web belongs to same type webpage with the first aim webpage；

It is that identical child node validation of information is webpage impurity by comparative result, and by next target The text extracting information of webpage is saved in database；

The step of performing the text extracting information for extracting next target web is returned to, until traveling through all mesh Mark webpage.

In addition, also including：

All child node information to being saved in database set corresponding counter；

According to comparative result, every time by comparative result for identical child node information is defined as webpage impurity； The counter that comparative result is different child node information plus one, after the value of counter reaches threshold value, The corresponding child node information of the counter is no longer preserved in database.

Wherein, the child node information includes text message and/or picture；

It is described that wherein each child node information is believed with the text extracting of the target web of preservation in the database In breath each child node information be compared be with the Hash encoded radio of the text message of child node information and/or Image link information is compared.

In addition, also including：

Corresponding counter is set to the webpage impurity；

When impurity information filtering is carried out according to the webpage impurity to the same type webpage, if described Have in the text extracting information of same type webpage during with the webpage impurity identical impurity information, will be right The counter O reset of the webpage impurity is answered, if not having in the text extracting information of the same type webpage During with the webpage impurity identical impurity information, the counter of the correspondence webpage impurity is added one, when After the value of counter reaches threshold value, the corresponding webpage impurity of the counter is no longer preserved.

Wherein, the same type webpage is belonging to the webpage of same wechat public number.

According to an aspect of the present invention, a kind of Web page text extraction element provided in an embodiment of the present invention, its Including：

Webpage impurity confirms processing module, is carried out for the text extracting information at least two target webs Compare, and be identical node by comparative result in the text extracting information of at least two target web Validation of information is webpage impurity, and at least two target web belongs to same type webpage；

Filter processing module, for carrying out impurity letter according to the webpage impurity to the same type webpage Breath is filtrated to get the text message of the webpage.

According to an aspect of the present invention, a kind of dress extracted for Web page text provided in an embodiment of the present invention Put, it includes memory, and one or more than one program, one of them or one with Upper program storage is configured to one as described in one or more than one computing device in memory Individual or more than one program bag is containing the instruction for carrying out following operation：

In addition, also include, be configured to by one or more than one computing device it is one or More than one program bag is containing the instruction for carrying out following operation：

In addition, also including：Be configured to by one or more than one computing device it is one or More than one program bag is containing the instruction for carrying out following operation：

The child node information includes text message and/or picture；

Corresponding counter is set to the webpage impurity；

The webpage context extraction method and device for providing according to embodiments of the present invention, it is by least two The text extracting information of target web is compared, and by the text extracting of at least two target web Comparative result confirms as webpage impurity, at least two target web for identical nodal information in information Belong to same type webpage；Impurity information mistake is carried out according to the webpage impurity to the same type webpage Filter obtains the text message of the webpage.Due to can determine that the text extracting information of same type target web In impurity information, and then according to the impurity information, the text to the same type of target web is taken out Breath of winning the confidence filter and finally can obtain more accurate text message.

Brief description of the drawings

In order to illustrate more clearly about the embodiment of the present invention or technical scheme of the prior art, below will be to reality The accompanying drawing to be used needed for example or description of the prior art is applied to be briefly described, it should be apparent that, below Accompanying drawing in description is only some embodiments described in the present invention, for those of ordinary skill in the art For, other accompanying drawings can also be obtained according to these accompanying drawings.

Fig. 1 is according to an exemplary flow chart for implementing webpage context extraction method of the present invention；

Fig. 2 is to confirm as webpage impurity according in an exemplary implementation webpage context extraction method of the present invention One flow chart；

Fig. 3 is according to an exemplary webpage article schematic diagram for implementing XX public numbers in the present invention；

Fig. 4 is according to exemplary another webpage article schematic diagram for implementing XX public numbers in the present invention；

Fig. 5 be Fig. 3 and Fig. 4 two webpages in both identical text impurity information schematic diagrames；

Fig. 6 be Fig. 3 and Fig. 4 two webpages in both identical picture impurity information schematic diagrames；

Fig. 7 is according to an exemplary composition schematic diagram for implementing Web page text extraction element of the present invention；

Fig. 8 is according to an exemplary overall schematic for implementing Web page text extraction element of the present invention.

Specific embodiment

Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out Clearly and completely describe, it is clear that described embodiment is only a part of embodiment of the invention, and It is not all, of embodiment.Based on the embodiment in the present invention, what those of ordinary skill in the art were obtained Every other embodiment, belongs to the scope of protection of the invention.

Fig. 1 is referred to, it is according to an exemplary flow chart for implementing webpage context extraction method of the present invention. The present embodiment realizes that the method that Web page text is extracted mainly comprises the following steps：

Step S101, the text extracting information at least two target webs is compared, and by described in extremely To confirm as webpage miscellaneous for identical nodal information for comparative result in few two text extracting information of target web Matter, at least two target web belongs to same type webpage；

When implementing, same type webpage refers to the webpage with same or like structure of web page, for example, Webpage under identical platform, the webpage under such as same wechat public number or the webpage under same website, In addition, the text extracting information of target web is that the information for obtaining is extracted according to text node in the present embodiment, In practice, text extracting is carried out to target web can capture the source code information of webpage using webpage capture device, Then DOM Document Object Model (DOM) is built by HTML (HTML) resolver and sets knot Structure, choosing corresponding text node can obtain text extracting information, and other nets can also be used in practice Page text extracting mode, is not specifically limited here.

In addition, identical nodal information described in the present embodiment refers to after carrying out text extracting to target web Identical nodal information in resulting text extracting information, phase in such as two text nodes of target web The text message or pictorial information of certain same child node, do not repeat here.

Step S102, carries out impurity information filtering and obtains to the same type webpage according to the webpage impurity To the text message of the webpage；

When implementing, because the impurity information of same type webpage is identical, therefore, according to above-mentioned The impurity information that step S101 confirms, rejects miscellaneous with this in the text extracting information of the same type webpage Matter information identical information is that can obtain more accurate text message, for example, step S101 is confirmed Impurity information as sample, if having in the text extracting information of same type of other webpages with this as sample This webpage impurity identical information, then can confirm that with the sample identical information be impurity information, filtering The correspondence more accurate text message of the webpage can be obtained after the impurity information.

In practice, in the same type of webpage webpage impurity may not be it is changeless, i.e., with when Between change, impurity information originally may no longer be present, or might have new impurity information, therefore, Used as one embodiment, with reference to Fig. 2, the above-mentioned text extracting information at least two target webs is carried out Compare, and be identical node by comparative result in the text extracting information of at least two target web Validation of information can be in the following way for webpage impurity：

Step S1011, the text extracting information for extracting first aim webpage is saved in the first aim Initialized in the corresponding database of the affiliated type of webpage；

When implementing, for example, corresponding data can be set up to the type described in first aim webpage first Storehouse, then the text extracting information of the first aim webpage is saved in the database is carried out initially Change, if including text message and picture in text extracting information in practice, can also by text message and Picture sets up corresponding tables of data to be subsequently compared respectively respectively；

Step S1012, extracts the text extracting information of next target web, and will wherein each child node Information is compared with each child node information in the text extracting information of the target web of preservation in the database Compared with next target web belongs to same type webpage with the first aim webpage；

Explanation is needed, text message is potentially included in the child node information, it is also possible to including picture, Or be likely to text message and picture both of which and include, and in order to improve the efficiency for comparing, while also may be used To reduce the storage pressure of database, for text message, each child node information Chinese version information can be pressed Corresponding Hash encoded radio is encoded to according to Hash, i.e., what is stored in database is each child node information Chinese version The corresponding Hash encoded radio of information, for picture, can store picture in each child node information in database Corresponding image link information, and the above-mentioned mesh that will be preserved in wherein each child node information and the database It is with child node information Chinese version information that each child node information is compared in the text extracting information of mark webpage Hash encoded radio and/or image link information be compared, repeat no more here.

Preferably, the child node for example can be the leaf node in dom tree, the area of leaf node Other nodes are no longer included in the range of domain, can improve what is compared as the object for comparing using leaf node Precision.

Step S1013, is that identical child node validation of information is webpage impurity by comparative result, and by institute The text extracting information for stating next target web is saved in database；

When implementing, if comparative result is identical, the identical child node information can confirm as net Page impurity, in addition, in order to avoid omitting webpage impurity or being easy to the new webpage impurity of extraction, can be described The text extracting information of next target web is also saved in database；Then can return to perform step S1012, that is, continue to extract the text extracting information of next one target web.

Explanation is needed, in order to mitigate the storage pressure of database, as a preferred embodiment, also Corresponding counter can be set to each child node information preserved in database；

According to comparative result, every time by comparative result for identical child node information is defined as webpage impurity； The counter that comparative result is different child node information plus one, when the value of certain counter reaches threshold value Afterwards, the corresponding child node information of the counter is no longer preserved in database.

In addition, for the webpage impurity for locking, after a period of time has passed, the webpage of the type webpage is miscellaneous Matter may change, and accordingly, as an optional embodiment, also the webpage impurity can be set Corresponding counter；Carried out according to the webpage impurity when to the same type webpage in step S102 During impurity information filtering, if having in the text extracting information of the same type webpage and the webpage impurity During identical impurity information, by the counter O reset of the correspondence webpage impurity, if the same type net When in the text extracting information of page not with the webpage impurity identical impurity information, by the correspondence net The counter of page impurity adds one, after the value of certain counter reaches threshold value, no longer preserves the counter pair The webpage impurity answered.

Illustrated by the citing of wechat public number webpage of target web below, due to same wechat public number Each webpage article in advertising message can reuse, i.e. same each webpage of wechat public number text The advertising message of chapter is identical (being in other words identical within a period of time), the wechat public Number webpage be the same type of webpage with same or similar structure of web page, due to the advertisement letter The text message that breath is extracted required for being not, as impurity information, can be according to same micro- in the present embodiment Advertising message is the characteristics of repeat in letter public number, then by comparing at least two mesh of certain wechat public number The text extracting information for marking webpage (for example can be two articles of the wechat public number of adjacent time inter Text extracting information) can just find out webpage impurity, then reject the webpage impurity information i.e. can obtain more Accurate text message.

When implementing, for example, analyze the wechat public number first, to should wechat public number build in advance A corresponding database is found, ephemeral data and the webpage impurity as sample can be stored in database Data etc., whether when carrying out text extracting, it has been that the wechat public number sets up number to be inquired about in database According to storehouse, if not setting up, corresponding database is set up for each wechat public number, wherein setting up two Table, a pictorial information for being used to store text viewing area, one is used to store text viewing area Text message；

If database belongs to initialized first, by the wechat webpage article of the wechat public number each The text message of child node carries out the figure that Hash encodes text Hash encoded radio and each child node for obtaining Piece link information is stored in database；When implementing, for example, the text viewing area of target web is selected It is the body nodes of the page；Body nodes are parsed into dom tree, by each leaf node of dom tree Text message be that Chinese character string is extracted and carries out Hash coding, while extracting each leaf node Image link information, the Hash encoded radio of each leaf node Chinese version information and image link information are protected It is stored in database.

Then it is further continued for processing the next wechat webpage article of the wechat public number, the wechat for obtaining will be processed In webpage article in the Hash encoded radio of each leaf node text message and image link information and date storehouse The data of storage are compared, if comparative result to judge that the identical leaf node information is net if identical Page impurity, locks in database using the identical leaf node information as webpage impurity, and can conduct Sample retains for a long time, and the data that other are not locked can be deleted, i.e., when tables of data is not belonging to head During secondary initialization, can not will there is no locked erasing of information in database, can so reduce depositing for database Storage pressure, the webpage impurity for locking can set up corresponding counter, when continuing with the wechat public Number follow-up wechat webpage article when, if the webpage impurity is not matched, counter adds 1, successful match Start-stop counter then resets, just by the webpage impurity in database only when counter reaches certain threshold value Remove.

In addition, for the ease of avoiding webpage impurity from omitting or being easy to extract new webpage impurity, it is micro- for this First wechat webpage article that letter public number is initialized is saved in database, as the wechat public Number first wechat webpage article of second wechat webpage article and this be compared locking one webpage it is miscellaneous After matter, by first wechat webpage article in database except the other information of webpage impurity information is removed, The full detail of second wechat webpage article of the wechat public number is saved in database simultaneously, so The 3rd wechat webpage article of the wechat public number is continued with afterwards, and the 3rd wechat webpage is literary When second wechat webpage article in chapter and database is compared, may may proceed to find new webpage Impurity, that is, continue the treatment of the wechat webpage article of next wechat public number, and such circular treatment is entered The locking of row webpage impurity, and according to locking webpage impurity to the wechat webpage article of the wechat public number It is filtrated to get more accurate text message.

Explanation is needed, the upper limit is not provided with for the webpage amount of impurities for locking in the present embodiment, once quilt Locking then can reach threshold value (i.e. as webpage impurity sample until meeting the corresponding counter of webpage impurity Reach cleared condition) just discharge.

Need explanation, above-mentioned first wechat webpage article, second wechat webpage article and the 3rd Wechat webpage article is merely for convenience and purposes of illustration, and is not limitation of the invention.

Text is extracted as example is illustrated with a webpage for wechat public number type below.

By taking XX public numbers as an example, the article in two of which webpage as shown in Figure 3 and Figure 4, can be seen respectively Going out the end of article in two webpages has advertising message for publicizing, and wide in two articles of webpage Announcement information is identical, and this advertising message is webpage impurity information, and webpage impurity information with just Each paragraph of text belongs to a node and coordination is presented, and with reference to Fig. 5, webpage is miscellaneous in the present embodiment Matter information includes text message as shown in figure 5, in addition, webpage impurity information also includes picture in the present embodiment Impurity information in the article of other webpages of the XX public numbers as shown in fig. 6, and actually also have identical Advertising message (i.e. webpage impurity information), but because wechat public number is made by oneself by wechat public number manager Justice, so web page joint structure is not fixed and webpage impurity information and text message are mixed, with Fixed text node goes the text message for extracting to include above-mentioned advertising message, but identical wechat is public What the advertising message under many numbers will not usually change within a period of time, therefore, can be by above-mentioned two The text extracting information of individual webpage be compared obtain identical child node information can be identified as webpage impurity letter Breath (i.e. information shown in Fig. 5 and Fig. 6) and as Sample preservation, then again to the XX public numbers other When webpage carries out text extracting treatment, rejected shown in above-mentioned Fig. 5 in the text extracting information that extraction is obtained Text impurity information and the picture impurity information rejected shown in above-mentioned Fig. 6 are that can obtain more accurate text Information.

Another aspect of the present invention is illustrated below.

With reference to Fig. 7, the figure is that a kind of composition of the Web page text extraction element according to an exemplary implementation shows It is intended to, mainly includes in the present embodiment：

Webpage impurity confirms processing module 1, and webpage impurity confirms that processing module 1 is mainly used in the present embodiment It is compared in the text extracting information at least two target webs, and by least two target network Comparative result confirms as webpage impurity for identical nodal information in the text extracting information of page, it is described at least Two target webs belong to same type webpage；

Filter processing module 2, filter processing module 2 is mainly used in the same type in the present embodiment Webpage carries out the text message that impurity information filtering obtains the webpage according to the webpage impurity.

Explanation is needed, on the device in above-described embodiment, wherein modules perform the specific of operation Mode has been described in detail in the embodiment about the method, will be not set forth in detail herein It is bright.

Fig. 8 is a kind of device 800 extracted for Web page text according to an exemplary embodiment Block diagram.For example, device 800 can be mobile phone, computer, digital broadcast terminal, information receiving and transmitting Equipment, tablet device, personal digital assistant etc..

Reference picture 8, device 800 can include following one or more assemblies：Processing assembly 802, storage Device 804, power supply module 806, multimedia groupware 808, audio-frequency assembly 810, input/output (I/O) Interface 812, sensor cluster 814, and communication component 816.

The integrated operation of the usual control device 800 of processing assembly 802, such as with display, call, Data communication, camera operation and the associated operation of record operation.Processing assembly 802 can include one Or multiple processors 820 carry out execute instruction, to complete all or part of step of above-mentioned method.Additionally, Processing assembly 802 can include one or more modules, be easy between processing assembly 802 and other assemblies Interaction.For example, processing component 802 can include multi-media module, to facilitate multimedia groupware 808 And the interaction between processing assembly 802.

Memory 804 is configured as storing various types of data supporting the operation in equipment 800.This The example of a little data includes the instruction for any application program or method operated on device 800, connection It is personal data, telephone book data, message, picture, video etc..Memory 804 can be by any types Volatibility or non-volatile memory device or combinations thereof realize, such as static RAM (SRAM), Electrically Erasable Read Only Memory (EEPROM), erasable programmable is read-only Memory (EPROM), programmable read only memory (PROM), read-only storage (ROM), Magnetic memory, flash memory, disk or CD.

Power supply module 806 provides electric power for the various assemblies of device 800.Power supply module 806 can include Power-supply management system, one or more power supplys, and other with generate, manage and distribute electricity for device 800 The associated component of power.

Multimedia groupware 808 is included in one output interface of offer between described device 800 and user Screen.In certain embodiments, screen can include liquid crystal display (LCD) and touch panel (TP). If screen includes touch panel, screen may be implemented as touch-screen, to receive the input from user Signal.Touch panel includes one or more touch sensors with sensing touch, slip and touch panel Gesture.The touch sensor can not only sensing touch or sliding action border, but also detect The duration related to the touch or slide and pressure.In certain embodiments, multimedia group Part 808 includes a front camera and/or rear camera.When equipment 800 is in operator scheme, such as When screening-mode or video mode, front camera and/or rear camera can receive outside multimedia Data.Each front camera and rear camera can be a fixed optical lens system or have Focusing and optical zoom capabilities.

Audio-frequency assembly 810 is configured as output and/or input audio signal.For example, audio-frequency assembly 810 is wrapped A microphone (MIC) is included, when device 800 is in operator scheme, such as call model, logging mode During with speech recognition mode, microphone is configured as receiving external audio signal.The audio signal for being received Can be further stored in memory 804 or be sent via communication component 816.In certain embodiments, Audio-frequency assembly 810 also includes a loudspeaker, for exports audio signal.

I/O interfaces 812 are that interface, above-mentioned periphery are provided between processing assembly 802 and peripheral interface module Interface module can be keyboard, click wheel, button etc..These buttons may include but be not limited to：Homepage is pressed Button, volume button, start button and locking press button.

Sensor cluster 814 includes one or more sensors, for providing various aspects for device 800 State estimation.For example, sensor cluster 814 can detect the opening/closed mode of equipment 800, The relative positioning of component, such as described component is the display and keypad of device 800, sensor cluster 814 can be with the change of the position of 800 1 components of detection means 800 or device, user and device 800 Presence or absence of, the temperature change of the orientation of device 800 or acceleration/deceleration and device 800 of contact.Pass Sensor component 814 can include proximity transducer, be configured to be examined when without any physical contact Survey the presence of object nearby.Sensor cluster 814 can also include optical sensor, such as CMOS or CCD Imageing sensor, for being used in imaging applications.In certain embodiments, the sensor cluster 814 Acceleration transducer can also be included, gyro sensor, Magnetic Sensor, pressure sensor or temperature are passed Sensor.

Communication component 816 is configured to facilitate wired or wireless way between device 800 and other equipment Communication.Device 800 can access the wireless network based on communication standard, such as WiFi, 2G or 3G, or Combinations thereof.In one exemplary embodiment, communication component 816 is received via broadcast channel and come from The broadcast singal or broadcast related information of external broadcasting management system.In one exemplary embodiment, institute Stating communication component 816 also includes near-field communication (NFC) module, to promote junction service.For example, NFC module can be based on radio frequency identification (RFID) technology, and Infrared Data Association (IrDA) technology surpasses Broadband (UWB) technology, bluetooth (BT) technology and other technologies are realized.

In the exemplary embodiment, device 800 can be by one or more application specific integrated circuits (ASIC), digital signal processor (DSP), digital signal processing appts (DSPD), can compile It is journey logical device (PLD), field programmable gate array (FPGA), controller, microcontroller, micro- Processor or other electronic components are realized, for performing the above method.

In the exemplary embodiment, a kind of non-transitory computer-readable storage including instructing is additionally provided Medium, such as, including the memory 804 for instructing, above-mentioned instruction can be held by the processor 820 of device 800 Go to complete the above method.For example, the non-transitorycomputer readable storage medium can be ROM, Random access memory (RAM), CD-ROM, tape, floppy disk and optical data storage devices etc..

A kind of non-transitorycomputer readable storage medium, when the instruction in the storage medium is by moving end During the computing device at end so that mobile terminal is able to carry out a kind of webpage context extraction method, the side Method includes：

Those skilled in the art will readily occur to this after considering specification and putting into practice invention disclosed herein Other embodiments of invention.It is contemplated that covering any modification of the invention, purposes or adaptability Change, these modifications, purposes or adaptations follow general principle of the invention and including this public affairs Open undocumented common knowledge or conventional techniques in the art.Description and embodiments only by It is considered as exemplary, true scope and spirit of the invention are pointed out by following claim.

It should be appreciated that the invention is not limited in be described above and be shown in the drawings it is accurate Structure, and can without departing from the scope carry out various modifications and changes.The scope of the present invention is only by institute Attached claim is limited

Presently preferred embodiments of the present invention is the foregoing is only, is not intended to limit the invention, it is all in the present invention Spirit and principle within, any modification, equivalent substitution and improvements made etc. should be included in this hair Within bright protection domain.

Claims

1. a kind of webpage context extraction method, it is characterised in that including：

This is carried out to the text extracting information of at least two target webs compared with and by least two target This relatively result confirms as webpage impurity for identical nodal information in the text extracting information of webpage, it is described extremely Few two target webs belong to same type webpage；

2. method according to claim 1, it is characterised in that described at least two target webs Text extracting information carry out this compared with, and by the text extracting information of at least two target web this Relatively result is confirmed as webpage impurity and is specifically included for identical nodal information：

The text extracting information of next target web is extracted, and will wherein each child node information and the number This is carried out compared with described next according to each child node information in the text extracting information of the target web preserved in storehouse Individual target web belongs to same type webpage with the first aim webpage；

It is that identical child node validation of information is webpage impurity by this relatively result, and by next target The text extracting information of webpage is saved in database；

3. method according to claim 2, it is characterised in that also include：

According to this relatively result, every time by this relatively result for identical child node information is defined as webpage impurity； The counter that this relatively result is different child node information plus one, when the value of certain counter reaches threshold value Afterwards, the corresponding child node information of the counter is no longer preserved in database.

4. method according to claim 2, it is characterised in that the child node information includes text Information and/or picture；

It is described that wherein each child node information is believed with the text extracting of the target web of preservation in the database In breath each child node information carry out this be relatively with the Hash encoded radio of the text message of child node information and/or Image link information carry out this compared with.

5. method according to claim 1, it is characterised in that also include：

Corresponding counter is set to the webpage impurity；

When impurity information filtering is carried out according to the webpage impurity to the same type webpage, if described Have in the text extracting information of same type webpage during with the webpage impurity identical impurity information, will be right The counter O reset of the webpage impurity is answered, if not having in the text extracting information of the same type webpage During with the webpage impurity identical impurity information, the counter of the correspondence webpage impurity is added one, when After the value of certain counter reaches threshold value, the corresponding webpage impurity of the counter is no longer preserved.

6. method according to claim 1, it is characterised in that the same type webpage is belonging to The webpage of same wechat public number.

7. a kind of Web page text extraction element, it is characterised in that including：

Webpage impurity confirms processing module, is carried out for the text extracting information at least two target webs This compared with, and by this relatively result in the text extracting information of at least two target web be identical node Validation of information is webpage impurity, and at least two target web belongs to same type webpage；

8. it is a kind of for Web page text extract device, it is characterised in that include memory, Yi Jiyi Individual or more than one program, one of them or more than one program storage is in memory, and warp Configuration is with by one or more than one computing device is one or more than one program bag is containing being used for Carry out the instruction of following operation：

9. device according to claim 8, it is characterised in that also include, be configured to by Or more than one computing device is one or more than one program bag is containing for carrying out following operation Instruction：

10. device according to claim 9, it is characterised in that also include：It is configured to by one Individual or more than one computing device is one or more than one program bag is containing for carrying out following behaviour The instruction of work：

11. devices according to claim 9, it is characterised in that also include：It is configured to by one Individual or more than one computing device is one or more than one program bag is containing for carrying out following behaviour The instruction of work：

The child node information includes text message and/or picture；

12. devices according to claim 8, it is characterised in that also include：It is configured to by one Individual or more than one computing device is one or more than one program bag is containing for carrying out following behaviour The instruction of work：

Corresponding counter is set to the webpage impurity；

13. devices according to claim 8, it is characterised in that the same type webpage is category In the webpage of same wechat public number.