CN106855859A - A kind of webpage context extraction method and device - Google Patents
A kind of webpage context extraction method and device Download PDFInfo
- Publication number
- CN106855859A CN106855859A CN201510897907.7A CN201510897907A CN106855859A CN 106855859 A CN106855859 A CN 106855859A CN 201510897907 A CN201510897907 A CN 201510897907A CN 106855859 A CN106855859 A CN 106855859A
- Authority
- CN
- China
- Prior art keywords
- webpage
- information
- impurity
- text
- child node
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9535—Search customisation based on user profiles and personalisation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/332—Query formulation
- G06F16/3325—Reformulation based on results of preceding query
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Mathematical Physics (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Information Transfer Between Computers (AREA)
Abstract
The invention discloses a kind of webpage context extraction method and device, it is compared by the text extracting information at least two target webs, and by comparative result in the text extracting information of at least two target web for identical nodal information confirms as webpage impurity, at least two target web belongs to same type webpage;The text message that impurity information filtering obtains the webpage is carried out according to the webpage impurity to the same type webpage.Due to the impurity information in the text extracting information that can determine that same type target web, and then according to the impurity information, the text extracting information to the same type of target web filter finally available more accurate text message.
Description
Technical field
The present invention relates to Internet technical field, more particularly, it relates to a kind of Web page text is carried
Take method and device.
Background technology
At present, Web page text is extracted the general extraction mode using based on template or is taken out based on word density
The mode for taking, i.e., taken out by selecting fixed node or carry out text according to the node with text feature
Take.General, the text extracting scheme based on node selection captures webpage by webpage capture device first
Source code information, then by the source code information architecture DOM Document Object Model of webpage (DOM,
Document Object Model) tree, then choose corresponding Node extraction and go out text message, such as certain
The text viewing area of a little webpages can be fixed on a node, then only need to find this text node, so
The text under this text node is taken out afterwards, but when the impurity information for needing to reject is tight with text message
Solid matter is arranged and when under identical text node, and impurity information cannot then be rejected and obtain more smart by prior art
True Web page text information.
The content of the invention
In view of the above problems, it is proposed that the embodiment of the present invention overcomes above mentioned problem or extremely to provide one kind
A kind of Web page text extracting method and corresponding device for partially solving the above problems.
In order to solve the above-mentioned technical problem, a kind of webpage context extraction method provided in an embodiment of the present invention,
It includes:
Text extracting information at least two target webs is compared, and by least two target
Comparative result confirms as webpage impurity for identical nodal information in the text extracting information of webpage, it is described extremely
Few two target webs belong to same type webpage;
Impurity information filtering is carried out according to the webpage impurity to the same type webpage and obtains the webpage
Text message.
Wherein, the text extracting information at least two target webs is compared, and by described in extremely
To confirm as webpage miscellaneous for identical nodal information for comparative result in few two text extracting information of target web
Matter is specifically included;
The text extracting information for extracting first aim webpage is saved in the affiliated type of first aim webpage
Initialized in corresponding database;
The text extracting information of next target web is extracted, and will wherein each child node information and the number
It is compared according to each child node information in the text extracting information of the target web preserved in storehouse, it is described next
Individual target web belongs to same type webpage with the first aim webpage;
It is that identical child node validation of information is webpage impurity by comparative result, and by next target
The text extracting information of webpage is saved in database;
The step of performing the text extracting information for extracting next target web is returned to, until traveling through all mesh
Mark webpage.
In addition, also including:
All child node information to being saved in database set corresponding counter;
According to comparative result, every time by comparative result for identical child node information is defined as webpage impurity;
The counter that comparative result is different child node information plus one, after the value of counter reaches threshold value,
The corresponding child node information of the counter is no longer preserved in database.
Wherein, the child node information includes text message and/or picture;
It is described that wherein each child node information is believed with the text extracting of the target web of preservation in the database
In breath each child node information be compared be with the Hash encoded radio of the text message of child node information and/or
Image link information is compared.
In addition, also including:
Corresponding counter is set to the webpage impurity;
When impurity information filtering is carried out according to the webpage impurity to the same type webpage, if described
Have in the text extracting information of same type webpage during with the webpage impurity identical impurity information, will be right
The counter O reset of the webpage impurity is answered, if not having in the text extracting information of the same type webpage
During with the webpage impurity identical impurity information, the counter of the correspondence webpage impurity is added one, when
After the value of counter reaches threshold value, the corresponding webpage impurity of the counter is no longer preserved.
Wherein, the same type webpage is belonging to the webpage of same wechat public number.
According to an aspect of the present invention, a kind of Web page text extraction element provided in an embodiment of the present invention, its
Including:
Webpage impurity confirms processing module, is carried out for the text extracting information at least two target webs
Compare, and be identical node by comparative result in the text extracting information of at least two target web
Validation of information is webpage impurity, and at least two target web belongs to same type webpage;
Filter processing module, for carrying out impurity letter according to the webpage impurity to the same type webpage
Breath is filtrated to get the text message of the webpage.
According to an aspect of the present invention, a kind of dress extracted for Web page text provided in an embodiment of the present invention
Put, it includes memory, and one or more than one program, one of them or one with
Upper program storage is configured to one as described in one or more than one computing device in memory
Individual or more than one program bag is containing the instruction for carrying out following operation:
Text extracting information at least two target webs is compared, and by least two target
Comparative result confirms as webpage impurity for identical nodal information in the text extracting information of webpage, it is described extremely
Few two target webs belong to same type webpage;
Impurity information filtering is carried out according to the webpage impurity to the same type webpage and obtains the webpage
Text message.
In addition, also include, be configured to by one or more than one computing device it is one or
More than one program bag is containing the instruction for carrying out following operation:
The text extracting information for extracting first aim webpage is saved in the affiliated type of first aim webpage
Initialized in corresponding database;
The text extracting information of next target web is extracted, and will wherein each child node information and the number
It is compared according to each child node information in the text extracting information of the target web preserved in storehouse, it is described next
Individual target web belongs to same type webpage with the first aim webpage;
It is that identical child node validation of information is webpage impurity by comparative result, and by next target
The text extracting information of webpage is saved in database;
The step of performing the text extracting information for extracting next target web is returned to, until traveling through all mesh
Mark webpage.
In addition, also including:Be configured to by one or more than one computing device it is one or
More than one program bag is containing the instruction for carrying out following operation:
All child node information to being saved in database set corresponding counter;
According to comparative result, every time by comparative result for identical child node information is defined as webpage impurity;
The counter that comparative result is different child node information plus one, after the value of counter reaches threshold value,
The corresponding child node information of the counter is no longer preserved in database.
In addition, also including:Be configured to by one or more than one computing device it is one or
More than one program bag is containing the instruction for carrying out following operation:
The child node information includes text message and/or picture;
It is described that wherein each child node information is believed with the text extracting of the target web of preservation in the database
In breath each child node information be compared be with the Hash encoded radio of the text message of child node information and/or
Image link information is compared.
In addition, also including:Be configured to by one or more than one computing device it is one or
More than one program bag is containing the instruction for carrying out following operation:
Corresponding counter is set to the webpage impurity;
When impurity information filtering is carried out according to the webpage impurity to the same type webpage, if described
Have in the text extracting information of same type webpage during with the webpage impurity identical impurity information, will be right
The counter O reset of the webpage impurity is answered, if not having in the text extracting information of the same type webpage
During with the webpage impurity identical impurity information, the counter of the correspondence webpage impurity is added one, when
After the value of counter reaches threshold value, the corresponding webpage impurity of the counter is no longer preserved.
Wherein, the same type webpage is belonging to the webpage of same wechat public number.
The webpage context extraction method and device for providing according to embodiments of the present invention, it is by least two
The text extracting information of target web is compared, and by the text extracting of at least two target web
Comparative result confirms as webpage impurity, at least two target web for identical nodal information in information
Belong to same type webpage;Impurity information mistake is carried out according to the webpage impurity to the same type webpage
Filter obtains the text message of the webpage.Due to can determine that the text extracting information of same type target web
In impurity information, and then according to the impurity information, the text to the same type of target web is taken out
Breath of winning the confidence filter and finally can obtain more accurate text message.
Brief description of the drawings
In order to illustrate more clearly about the embodiment of the present invention or technical scheme of the prior art, below will be to reality
The accompanying drawing to be used needed for example or description of the prior art is applied to be briefly described, it should be apparent that, below
Accompanying drawing in description is only some embodiments described in the present invention, for those of ordinary skill in the art
For, other accompanying drawings can also be obtained according to these accompanying drawings.
Fig. 1 is according to an exemplary flow chart for implementing webpage context extraction method of the present invention;
Fig. 2 is to confirm as webpage impurity according in an exemplary implementation webpage context extraction method of the present invention
One flow chart;
Fig. 3 is according to an exemplary webpage article schematic diagram for implementing XX public numbers in the present invention;
Fig. 4 is according to exemplary another webpage article schematic diagram for implementing XX public numbers in the present invention;
Fig. 5 be Fig. 3 and Fig. 4 two webpages in both identical text impurity information schematic diagrames;
Fig. 6 be Fig. 3 and Fig. 4 two webpages in both identical picture impurity information schematic diagrames;
Fig. 7 is according to an exemplary composition schematic diagram for implementing Web page text extraction element of the present invention;
Fig. 8 is according to an exemplary overall schematic for implementing Web page text extraction element of the present invention.
Specific embodiment
Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out
Clearly and completely describe, it is clear that described embodiment is only a part of embodiment of the invention, and
It is not all, of embodiment.Based on the embodiment in the present invention, what those of ordinary skill in the art were obtained
Every other embodiment, belongs to the scope of protection of the invention.
Fig. 1 is referred to, it is according to an exemplary flow chart for implementing webpage context extraction method of the present invention.
The present embodiment realizes that the method that Web page text is extracted mainly comprises the following steps:
Step S101, the text extracting information at least two target webs is compared, and by described in extremely
To confirm as webpage miscellaneous for identical nodal information for comparative result in few two text extracting information of target web
Matter, at least two target web belongs to same type webpage;
When implementing, same type webpage refers to the webpage with same or like structure of web page, for example,
Webpage under identical platform, the webpage under such as same wechat public number or the webpage under same website,
In addition, the text extracting information of target web is that the information for obtaining is extracted according to text node in the present embodiment,
In practice, text extracting is carried out to target web can capture the source code information of webpage using webpage capture device,
Then DOM Document Object Model (DOM) is built by HTML (HTML) resolver and sets knot
Structure, choosing corresponding text node can obtain text extracting information, and other nets can also be used in practice
Page text extracting mode, is not specifically limited here.
In addition, identical nodal information described in the present embodiment refers to after carrying out text extracting to target web
Identical nodal information in resulting text extracting information, phase in such as two text nodes of target web
The text message or pictorial information of certain same child node, do not repeat here.
Step S102, carries out impurity information filtering and obtains to the same type webpage according to the webpage impurity
To the text message of the webpage;
When implementing, because the impurity information of same type webpage is identical, therefore, according to above-mentioned
The impurity information that step S101 confirms, rejects miscellaneous with this in the text extracting information of the same type webpage
Matter information identical information is that can obtain more accurate text message, for example, step S101 is confirmed
Impurity information as sample, if having in the text extracting information of same type of other webpages with this as sample
This webpage impurity identical information, then can confirm that with the sample identical information be impurity information, filtering
The correspondence more accurate text message of the webpage can be obtained after the impurity information.
In practice, in the same type of webpage webpage impurity may not be it is changeless, i.e., with when
Between change, impurity information originally may no longer be present, or might have new impurity information, therefore,
Used as one embodiment, with reference to Fig. 2, the above-mentioned text extracting information at least two target webs is carried out
Compare, and be identical node by comparative result in the text extracting information of at least two target web
Validation of information can be in the following way for webpage impurity:
Step S1011, the text extracting information for extracting first aim webpage is saved in the first aim
Initialized in the corresponding database of the affiliated type of webpage;
When implementing, for example, corresponding data can be set up to the type described in first aim webpage first
Storehouse, then the text extracting information of the first aim webpage is saved in the database is carried out initially
Change, if including text message and picture in text extracting information in practice, can also by text message and
Picture sets up corresponding tables of data to be subsequently compared respectively respectively;
Step S1012, extracts the text extracting information of next target web, and will wherein each child node
Information is compared with each child node information in the text extracting information of the target web of preservation in the database
Compared with next target web belongs to same type webpage with the first aim webpage;
Explanation is needed, text message is potentially included in the child node information, it is also possible to including picture,
Or be likely to text message and picture both of which and include, and in order to improve the efficiency for comparing, while also may be used
To reduce the storage pressure of database, for text message, each child node information Chinese version information can be pressed
Corresponding Hash encoded radio is encoded to according to Hash, i.e., what is stored in database is each child node information Chinese version
The corresponding Hash encoded radio of information, for picture, can store picture in each child node information in database
Corresponding image link information, and the above-mentioned mesh that will be preserved in wherein each child node information and the database
It is with child node information Chinese version information that each child node information is compared in the text extracting information of mark webpage
Hash encoded radio and/or image link information be compared, repeat no more here.
Preferably, the child node for example can be the leaf node in dom tree, the area of leaf node
Other nodes are no longer included in the range of domain, can improve what is compared as the object for comparing using leaf node
Precision.
Step S1013, is that identical child node validation of information is webpage impurity by comparative result, and by institute
The text extracting information for stating next target web is saved in database;
When implementing, if comparative result is identical, the identical child node information can confirm as net
Page impurity, in addition, in order to avoid omitting webpage impurity or being easy to the new webpage impurity of extraction, can be described
The text extracting information of next target web is also saved in database;Then can return to perform step
S1012, that is, continue to extract the text extracting information of next one target web.
Explanation is needed, in order to mitigate the storage pressure of database, as a preferred embodiment, also
Corresponding counter can be set to each child node information preserved in database;
According to comparative result, every time by comparative result for identical child node information is defined as webpage impurity;
The counter that comparative result is different child node information plus one, when the value of certain counter reaches threshold value
Afterwards, the corresponding child node information of the counter is no longer preserved in database.
In addition, for the webpage impurity for locking, after a period of time has passed, the webpage of the type webpage is miscellaneous
Matter may change, and accordingly, as an optional embodiment, also the webpage impurity can be set
Corresponding counter;Carried out according to the webpage impurity when to the same type webpage in step S102
During impurity information filtering, if having in the text extracting information of the same type webpage and the webpage impurity
During identical impurity information, by the counter O reset of the correspondence webpage impurity, if the same type net
When in the text extracting information of page not with the webpage impurity identical impurity information, by the correspondence net
The counter of page impurity adds one, after the value of certain counter reaches threshold value, no longer preserves the counter pair
The webpage impurity answered.
Illustrated by the citing of wechat public number webpage of target web below, due to same wechat public number
Each webpage article in advertising message can reuse, i.e. same each webpage of wechat public number text
The advertising message of chapter is identical (being in other words identical within a period of time), the wechat public
Number webpage be the same type of webpage with same or similar structure of web page, due to the advertisement letter
The text message that breath is extracted required for being not, as impurity information, can be according to same micro- in the present embodiment
Advertising message is the characteristics of repeat in letter public number, then by comparing at least two mesh of certain wechat public number
The text extracting information for marking webpage (for example can be two articles of the wechat public number of adjacent time inter
Text extracting information) can just find out webpage impurity, then reject the webpage impurity information i.e. can obtain more
Accurate text message.
When implementing, for example, analyze the wechat public number first, to should wechat public number build in advance
A corresponding database is found, ephemeral data and the webpage impurity as sample can be stored in database
Data etc., whether when carrying out text extracting, it has been that the wechat public number sets up number to be inquired about in database
According to storehouse, if not setting up, corresponding database is set up for each wechat public number, wherein setting up two
Table, a pictorial information for being used to store text viewing area, one is used to store text viewing area
Text message;
If database belongs to initialized first, by the wechat webpage article of the wechat public number each
The text message of child node carries out the figure that Hash encodes text Hash encoded radio and each child node for obtaining
Piece link information is stored in database;When implementing, for example, the text viewing area of target web is selected
It is the body nodes of the page;Body nodes are parsed into dom tree, by each leaf node of dom tree
Text message be that Chinese character string is extracted and carries out Hash coding, while extracting each leaf node
Image link information, the Hash encoded radio of each leaf node Chinese version information and image link information are protected
It is stored in database.
Then it is further continued for processing the next wechat webpage article of the wechat public number, the wechat for obtaining will be processed
In webpage article in the Hash encoded radio of each leaf node text message and image link information and date storehouse
The data of storage are compared, if comparative result to judge that the identical leaf node information is net if identical
Page impurity, locks in database using the identical leaf node information as webpage impurity, and can conduct
Sample retains for a long time, and the data that other are not locked can be deleted, i.e., when tables of data is not belonging to head
During secondary initialization, can not will there is no locked erasing of information in database, can so reduce depositing for database
Storage pressure, the webpage impurity for locking can set up corresponding counter, when continuing with the wechat public
Number follow-up wechat webpage article when, if the webpage impurity is not matched, counter adds 1, successful match
Start-stop counter then resets, just by the webpage impurity in database only when counter reaches certain threshold value
Remove.
In addition, for the ease of avoiding webpage impurity from omitting or being easy to extract new webpage impurity, it is micro- for this
First wechat webpage article that letter public number is initialized is saved in database, as the wechat public
Number first wechat webpage article of second wechat webpage article and this be compared locking one webpage it is miscellaneous
After matter, by first wechat webpage article in database except the other information of webpage impurity information is removed,
The full detail of second wechat webpage article of the wechat public number is saved in database simultaneously, so
The 3rd wechat webpage article of the wechat public number is continued with afterwards, and the 3rd wechat webpage is literary
When second wechat webpage article in chapter and database is compared, may may proceed to find new webpage
Impurity, that is, continue the treatment of the wechat webpage article of next wechat public number, and such circular treatment is entered
The locking of row webpage impurity, and according to locking webpage impurity to the wechat webpage article of the wechat public number
It is filtrated to get more accurate text message.
Explanation is needed, the upper limit is not provided with for the webpage amount of impurities for locking in the present embodiment, once quilt
Locking then can reach threshold value (i.e. as webpage impurity sample until meeting the corresponding counter of webpage impurity
Reach cleared condition) just discharge.
Need explanation, above-mentioned first wechat webpage article, second wechat webpage article and the 3rd
Wechat webpage article is merely for convenience and purposes of illustration, and is not limitation of the invention.
Text is extracted as example is illustrated with a webpage for wechat public number type below.
By taking XX public numbers as an example, the article in two of which webpage as shown in Figure 3 and Figure 4, can be seen respectively
Going out the end of article in two webpages has advertising message for publicizing, and wide in two articles of webpage
Announcement information is identical, and this advertising message is webpage impurity information, and webpage impurity information with just
Each paragraph of text belongs to a node and coordination is presented, and with reference to Fig. 5, webpage is miscellaneous in the present embodiment
Matter information includes text message as shown in figure 5, in addition, webpage impurity information also includes picture in the present embodiment
Impurity information in the article of other webpages of the XX public numbers as shown in fig. 6, and actually also have identical
Advertising message (i.e. webpage impurity information), but because wechat public number is made by oneself by wechat public number manager
Justice, so web page joint structure is not fixed and webpage impurity information and text message are mixed, with
Fixed text node goes the text message for extracting to include above-mentioned advertising message, but identical wechat is public
What the advertising message under many numbers will not usually change within a period of time, therefore, can be by above-mentioned two
The text extracting information of individual webpage be compared obtain identical child node information can be identified as webpage impurity letter
Breath (i.e. information shown in Fig. 5 and Fig. 6) and as Sample preservation, then again to the XX public numbers other
When webpage carries out text extracting treatment, rejected shown in above-mentioned Fig. 5 in the text extracting information that extraction is obtained
Text impurity information and the picture impurity information rejected shown in above-mentioned Fig. 6 are that can obtain more accurate text
Information.
Another aspect of the present invention is illustrated below.
With reference to Fig. 7, the figure is that a kind of composition of the Web page text extraction element according to an exemplary implementation shows
It is intended to, mainly includes in the present embodiment:
Webpage impurity confirms processing module 1, and webpage impurity confirms that processing module 1 is mainly used in the present embodiment
It is compared in the text extracting information at least two target webs, and by least two target network
Comparative result confirms as webpage impurity for identical nodal information in the text extracting information of page, it is described at least
Two target webs belong to same type webpage;
Filter processing module 2, filter processing module 2 is mainly used in the same type in the present embodiment
Webpage carries out the text message that impurity information filtering obtains the webpage according to the webpage impurity.
Explanation is needed, on the device in above-described embodiment, wherein modules perform the specific of operation
Mode has been described in detail in the embodiment about the method, will be not set forth in detail herein
It is bright.
Fig. 8 is a kind of device 800 extracted for Web page text according to an exemplary embodiment
Block diagram.For example, device 800 can be mobile phone, computer, digital broadcast terminal, information receiving and transmitting
Equipment, tablet device, personal digital assistant etc..
Reference picture 8, device 800 can include following one or more assemblies:Processing assembly 802, storage
Device 804, power supply module 806, multimedia groupware 808, audio-frequency assembly 810, input/output (I/O)
Interface 812, sensor cluster 814, and communication component 816.
The integrated operation of the usual control device 800 of processing assembly 802, such as with display, call,
Data communication, camera operation and the associated operation of record operation.Processing assembly 802 can include one
Or multiple processors 820 carry out execute instruction, to complete all or part of step of above-mentioned method.Additionally,
Processing assembly 802 can include one or more modules, be easy between processing assembly 802 and other assemblies
Interaction.For example, processing component 802 can include multi-media module, to facilitate multimedia groupware 808
And the interaction between processing assembly 802.
Memory 804 is configured as storing various types of data supporting the operation in equipment 800.This
The example of a little data includes the instruction for any application program or method operated on device 800, connection
It is personal data, telephone book data, message, picture, video etc..Memory 804 can be by any types
Volatibility or non-volatile memory device or combinations thereof realize, such as static RAM
(SRAM), Electrically Erasable Read Only Memory (EEPROM), erasable programmable is read-only
Memory (EPROM), programmable read only memory (PROM), read-only storage (ROM),
Magnetic memory, flash memory, disk or CD.
Power supply module 806 provides electric power for the various assemblies of device 800.Power supply module 806 can include
Power-supply management system, one or more power supplys, and other with generate, manage and distribute electricity for device 800
The associated component of power.
Multimedia groupware 808 is included in one output interface of offer between described device 800 and user
Screen.In certain embodiments, screen can include liquid crystal display (LCD) and touch panel (TP).
If screen includes touch panel, screen may be implemented as touch-screen, to receive the input from user
Signal.Touch panel includes one or more touch sensors with sensing touch, slip and touch panel
Gesture.The touch sensor can not only sensing touch or sliding action border, but also detect
The duration related to the touch or slide and pressure.In certain embodiments, multimedia group
Part 808 includes a front camera and/or rear camera.When equipment 800 is in operator scheme, such as
When screening-mode or video mode, front camera and/or rear camera can receive outside multimedia
Data.Each front camera and rear camera can be a fixed optical lens system or have
Focusing and optical zoom capabilities.
Audio-frequency assembly 810 is configured as output and/or input audio signal.For example, audio-frequency assembly 810 is wrapped
A microphone (MIC) is included, when device 800 is in operator scheme, such as call model, logging mode
During with speech recognition mode, microphone is configured as receiving external audio signal.The audio signal for being received
Can be further stored in memory 804 or be sent via communication component 816.In certain embodiments,
Audio-frequency assembly 810 also includes a loudspeaker, for exports audio signal.
I/O interfaces 812 are that interface, above-mentioned periphery are provided between processing assembly 802 and peripheral interface module
Interface module can be keyboard, click wheel, button etc..These buttons may include but be not limited to:Homepage is pressed
Button, volume button, start button and locking press button.
Sensor cluster 814 includes one or more sensors, for providing various aspects for device 800
State estimation.For example, sensor cluster 814 can detect the opening/closed mode of equipment 800,
The relative positioning of component, such as described component is the display and keypad of device 800, sensor cluster
814 can be with the change of the position of 800 1 components of detection means 800 or device, user and device 800
Presence or absence of, the temperature change of the orientation of device 800 or acceleration/deceleration and device 800 of contact.Pass
Sensor component 814 can include proximity transducer, be configured to be examined when without any physical contact
Survey the presence of object nearby.Sensor cluster 814 can also include optical sensor, such as CMOS or CCD
Imageing sensor, for being used in imaging applications.In certain embodiments, the sensor cluster 814
Acceleration transducer can also be included, gyro sensor, Magnetic Sensor, pressure sensor or temperature are passed
Sensor.
Communication component 816 is configured to facilitate wired or wireless way between device 800 and other equipment
Communication.Device 800 can access the wireless network based on communication standard, such as WiFi, 2G or 3G, or
Combinations thereof.In one exemplary embodiment, communication component 816 is received via broadcast channel and come from
The broadcast singal or broadcast related information of external broadcasting management system.In one exemplary embodiment, institute
Stating communication component 816 also includes near-field communication (NFC) module, to promote junction service.For example,
NFC module can be based on radio frequency identification (RFID) technology, and Infrared Data Association (IrDA) technology surpasses
Broadband (UWB) technology, bluetooth (BT) technology and other technologies are realized.
In the exemplary embodiment, device 800 can be by one or more application specific integrated circuits
(ASIC), digital signal processor (DSP), digital signal processing appts (DSPD), can compile
It is journey logical device (PLD), field programmable gate array (FPGA), controller, microcontroller, micro-
Processor or other electronic components are realized, for performing the above method.
In the exemplary embodiment, a kind of non-transitory computer-readable storage including instructing is additionally provided
Medium, such as, including the memory 804 for instructing, above-mentioned instruction can be held by the processor 820 of device 800
Go to complete the above method.For example, the non-transitorycomputer readable storage medium can be ROM,
Random access memory (RAM), CD-ROM, tape, floppy disk and optical data storage devices etc..
A kind of non-transitorycomputer readable storage medium, when the instruction in the storage medium is by moving end
During the computing device at end so that mobile terminal is able to carry out a kind of webpage context extraction method, the side
Method includes:
Text extracting information at least two target webs is compared, and by least two target
Comparative result confirms as webpage impurity for identical nodal information in the text extracting information of webpage, it is described extremely
Few two target webs belong to same type webpage;
Impurity information filtering is carried out according to the webpage impurity to the same type webpage and obtains the webpage
Text message.
Those skilled in the art will readily occur to this after considering specification and putting into practice invention disclosed herein
Other embodiments of invention.It is contemplated that covering any modification of the invention, purposes or adaptability
Change, these modifications, purposes or adaptations follow general principle of the invention and including this public affairs
Open undocumented common knowledge or conventional techniques in the art.Description and embodiments only by
It is considered as exemplary, true scope and spirit of the invention are pointed out by following claim.
It should be appreciated that the invention is not limited in be described above and be shown in the drawings it is accurate
Structure, and can without departing from the scope carry out various modifications and changes.The scope of the present invention is only by institute
Attached claim is limited
Presently preferred embodiments of the present invention is the foregoing is only, is not intended to limit the invention, it is all in the present invention
Spirit and principle within, any modification, equivalent substitution and improvements made etc. should be included in this hair
Within bright protection domain.
Claims (13)
1. a kind of webpage context extraction method, it is characterised in that including:
This is carried out to the text extracting information of at least two target webs compared with and by least two target
This relatively result confirms as webpage impurity for identical nodal information in the text extracting information of webpage, it is described extremely
Few two target webs belong to same type webpage;
Impurity information filtering is carried out according to the webpage impurity to the same type webpage and obtains the webpage
Text message.
2. method according to claim 1, it is characterised in that described at least two target webs
Text extracting information carry out this compared with, and by the text extracting information of at least two target web this
Relatively result is confirmed as webpage impurity and is specifically included for identical nodal information:
The text extracting information for extracting first aim webpage is saved in the affiliated type of first aim webpage
Initialized in corresponding database;
The text extracting information of next target web is extracted, and will wherein each child node information and the number
This is carried out compared with described next according to each child node information in the text extracting information of the target web preserved in storehouse
Individual target web belongs to same type webpage with the first aim webpage;
It is that identical child node validation of information is webpage impurity by this relatively result, and by next target
The text extracting information of webpage is saved in database;
The step of performing the text extracting information for extracting next target web is returned to, until traveling through all mesh
Mark webpage.
3. method according to claim 2, it is characterised in that also include:
All child node information to being saved in database set corresponding counter;
According to this relatively result, every time by this relatively result for identical child node information is defined as webpage impurity;
The counter that this relatively result is different child node information plus one, when the value of certain counter reaches threshold value
Afterwards, the corresponding child node information of the counter is no longer preserved in database.
4. method according to claim 2, it is characterised in that the child node information includes text
Information and/or picture;
It is described that wherein each child node information is believed with the text extracting of the target web of preservation in the database
In breath each child node information carry out this be relatively with the Hash encoded radio of the text message of child node information and/or
Image link information carry out this compared with.
5. method according to claim 1, it is characterised in that also include:
Corresponding counter is set to the webpage impurity;
When impurity information filtering is carried out according to the webpage impurity to the same type webpage, if described
Have in the text extracting information of same type webpage during with the webpage impurity identical impurity information, will be right
The counter O reset of the webpage impurity is answered, if not having in the text extracting information of the same type webpage
During with the webpage impurity identical impurity information, the counter of the correspondence webpage impurity is added one, when
After the value of certain counter reaches threshold value, the corresponding webpage impurity of the counter is no longer preserved.
6. method according to claim 1, it is characterised in that the same type webpage is belonging to
The webpage of same wechat public number.
7. a kind of Web page text extraction element, it is characterised in that including:
Webpage impurity confirms processing module, is carried out for the text extracting information at least two target webs
This compared with, and by this relatively result in the text extracting information of at least two target web be identical node
Validation of information is webpage impurity, and at least two target web belongs to same type webpage;
Filter processing module, for carrying out impurity letter according to the webpage impurity to the same type webpage
Breath is filtrated to get the text message of the webpage.
8. it is a kind of for Web page text extract device, it is characterised in that include memory, Yi Jiyi
Individual or more than one program, one of them or more than one program storage is in memory, and warp
Configuration is with by one or more than one computing device is one or more than one program bag is containing being used for
Carry out the instruction of following operation:
This is carried out to the text extracting information of at least two target webs compared with and by least two target
This relatively result confirms as webpage impurity for identical nodal information in the text extracting information of webpage, it is described extremely
Few two target webs belong to same type webpage;
Impurity information filtering is carried out according to the webpage impurity to the same type webpage and obtains the webpage
Text message.
9. device according to claim 8, it is characterised in that also include, be configured to by
Or more than one computing device is one or more than one program bag is containing for carrying out following operation
Instruction:
The text extracting information for extracting first aim webpage is saved in the affiliated type of first aim webpage
Initialized in corresponding database;
The text extracting information of next target web is extracted, and will wherein each child node information and the number
This is carried out compared with described next according to each child node information in the text extracting information of the target web preserved in storehouse
Individual target web belongs to same type webpage with the first aim webpage;
It is that identical child node validation of information is webpage impurity by this relatively result, and by next target
The text extracting information of webpage is saved in database;
The step of performing the text extracting information for extracting next target web is returned to, until traveling through all mesh
Mark webpage.
10. device according to claim 9, it is characterised in that also include:It is configured to by one
Individual or more than one computing device is one or more than one program bag is containing for carrying out following behaviour
The instruction of work:
All child node information to being saved in database set corresponding counter;
According to this relatively result, every time by this relatively result for identical child node information is defined as webpage impurity;
The counter that this relatively result is different child node information plus one, when the value of certain counter reaches threshold value
Afterwards, the corresponding child node information of the counter is no longer preserved in database.
11. devices according to claim 9, it is characterised in that also include:It is configured to by one
Individual or more than one computing device is one or more than one program bag is containing for carrying out following behaviour
The instruction of work:
The child node information includes text message and/or picture;
It is described that wherein each child node information is believed with the text extracting of the target web of preservation in the database
In breath each child node information carry out this be relatively with the Hash encoded radio of the text message of child node information and/or
Image link information carry out this compared with.
12. devices according to claim 8, it is characterised in that also include:It is configured to by one
Individual or more than one computing device is one or more than one program bag is containing for carrying out following behaviour
The instruction of work:
Corresponding counter is set to the webpage impurity;
When impurity information filtering is carried out according to the webpage impurity to the same type webpage, if described
Have in the text extracting information of same type webpage during with the webpage impurity identical impurity information, will be right
The counter O reset of the webpage impurity is answered, if not having in the text extracting information of the same type webpage
During with the webpage impurity identical impurity information, the counter of the correspondence webpage impurity is added one, when
After the value of certain counter reaches threshold value, the corresponding webpage impurity of the counter is no longer preserved.
13. devices according to claim 8, it is characterised in that the same type webpage is category
In the webpage of same wechat public number.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510897907.7A CN106855859B (en) | 2015-12-08 | 2015-12-08 | Webpage text extraction method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510897907.7A CN106855859B (en) | 2015-12-08 | 2015-12-08 | Webpage text extraction method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106855859A true CN106855859A (en) | 2017-06-16 |
CN106855859B CN106855859B (en) | 2020-11-10 |
Family
ID=59132795
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510897907.7A Active CN106855859B (en) | 2015-12-08 | 2015-12-08 | Webpage text extraction method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106855859B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109740101A (en) * | 2019-01-18 | 2019-05-10 | 杭州凡闻科技有限公司 | Data configuration method, public platform article cleaning method, apparatus and system |
Citations (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1786947A (en) * | 2004-12-07 | 2006-06-14 | 国际商业机器公司 | System, method and program for extracting web page core content based on web page layout |
US20090030891A1 (en) * | 2007-07-26 | 2009-01-29 | Siemens Aktiengesellschaft | Method and apparatus for extraction of textual content from hypertext web documents |
US20090063500A1 (en) * | 2007-08-31 | 2009-03-05 | Microsoft Corporation | Extracting data content items using template matching |
CN101826099A (en) * | 2010-02-04 | 2010-09-08 | 蓝盾信息安全技术股份有限公司 | Method and system for identifying similar documents and determining document diffusance |
CN101872350A (en) * | 2009-04-24 | 2010-10-27 | 富士通株式会社 | Web page text extracting method and device thereof |
CN102314513A (en) * | 2011-09-16 | 2012-01-11 | 华中科技大学 | Image text semantic extraction method based on GPU (Graphics Processing Unit) |
CN102479181A (en) * | 2010-11-22 | 2012-05-30 | 中国电信股份有限公司 | Method and device for extracting webpage text based on DIV (Division) position |
CN102541874A (en) * | 2010-12-16 | 2012-07-04 | 中国移动通信集团公司 | Webpage text content extracting method and device |
CN102663041A (en) * | 2012-03-28 | 2012-09-12 | 重庆大学 | Automatic extraction method oriented to data of deep web pages |
CN102810097A (en) * | 2011-06-02 | 2012-12-05 | 高德软件有限公司 | Method and device for extracting webpage text content |
CN103020266A (en) * | 2012-12-25 | 2013-04-03 | 北京奇虎科技有限公司 | Method and device for extracting webpage text content |
CN103955529A (en) * | 2014-05-12 | 2014-07-30 | 中国科学院计算机网络信息中心 | Internet information searching and aggregating presentation method |
CN104376061A (en) * | 2014-11-10 | 2015-02-25 | 武汉传神信息技术有限公司 | Webpage text extracting method |
CN105022803A (en) * | 2015-07-01 | 2015-11-04 | 广州市万隆证券咨询顾问有限公司 | Method and system for extracting text content of webpage |
-
2015
- 2015-12-08 CN CN201510897907.7A patent/CN106855859B/en active Active
Patent Citations (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1786947A (en) * | 2004-12-07 | 2006-06-14 | 国际商业机器公司 | System, method and program for extracting web page core content based on web page layout |
US20090030891A1 (en) * | 2007-07-26 | 2009-01-29 | Siemens Aktiengesellschaft | Method and apparatus for extraction of textual content from hypertext web documents |
US20090063500A1 (en) * | 2007-08-31 | 2009-03-05 | Microsoft Corporation | Extracting data content items using template matching |
CN101872350A (en) * | 2009-04-24 | 2010-10-27 | 富士通株式会社 | Web page text extracting method and device thereof |
CN101826099A (en) * | 2010-02-04 | 2010-09-08 | 蓝盾信息安全技术股份有限公司 | Method and system for identifying similar documents and determining document diffusance |
CN102479181A (en) * | 2010-11-22 | 2012-05-30 | 中国电信股份有限公司 | Method and device for extracting webpage text based on DIV (Division) position |
CN102541874A (en) * | 2010-12-16 | 2012-07-04 | 中国移动通信集团公司 | Webpage text content extracting method and device |
CN102810097A (en) * | 2011-06-02 | 2012-12-05 | 高德软件有限公司 | Method and device for extracting webpage text content |
CN102314513A (en) * | 2011-09-16 | 2012-01-11 | 华中科技大学 | Image text semantic extraction method based on GPU (Graphics Processing Unit) |
CN102663041A (en) * | 2012-03-28 | 2012-09-12 | 重庆大学 | Automatic extraction method oriented to data of deep web pages |
CN103020266A (en) * | 2012-12-25 | 2013-04-03 | 北京奇虎科技有限公司 | Method and device for extracting webpage text content |
CN103955529A (en) * | 2014-05-12 | 2014-07-30 | 中国科学院计算机网络信息中心 | Internet information searching and aggregating presentation method |
CN104376061A (en) * | 2014-11-10 | 2015-02-25 | 武汉传神信息技术有限公司 | Webpage text extracting method |
CN105022803A (en) * | 2015-07-01 | 2015-11-04 | 广州市万隆证券咨询顾问有限公司 | Method and system for extracting text content of webpage |
Non-Patent Citations (2)
Title |
---|
杨柳青 等: "基于布局相似性的网页正文内容提取研究", 《计算机应用研究》 * |
王亮 等: "基于树先剪枝的网页正文抽取方法研究", 《科技创新与应用》 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109740101A (en) * | 2019-01-18 | 2019-05-10 | 杭州凡闻科技有限公司 | Data configuration method, public platform article cleaning method, apparatus and system |
Also Published As
Publication number | Publication date |
---|---|
CN106855859B (en) | 2020-11-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104021350B (en) | Privacy information hidden method and device | |
CN103886025B (en) | The display methods and device of picture in webpage | |
CN106709399A (en) | Fingerprint identification method and device | |
CN105809225A (en) | Two-dimension code treatment method, apparatus, terminal equipment, two-dimension code product and packing box | |
CN104809158B (en) | Web content filter method and device | |
CN108664663A (en) | Recommendation information display methods, device and storage medium | |
CN105354284B (en) | Processing method and processing device, short message recognition methods and the device of template | |
CN104735243B (en) | Contact list displaying method and device | |
CN105956026A (en) | Webpage rendering method and apparatus | |
CN104240068A (en) | Method and device for creating reminding event | |
CN104636164B (en) | Start page generation method and device | |
CN108062547A (en) | Character detecting method and device | |
CN104615656A (en) | Image classification method and device | |
CN104778405A (en) | Method and device for blocking advertisements | |
CN105512220A (en) | Image page output method and device | |
CN104461348A (en) | Method and device for selecting information | |
CN105653612A (en) | Page rendering method and device | |
CN105630780A (en) | Webpage information processing method and apparatus | |
CN107193554A (en) | A kind of method and apparatus for generating front-end code | |
CN104216969B (en) | Read flag method and device | |
CN106504295A (en) | Render the method and device of picture | |
CN106921958A (en) | The method and apparatus for quitting the subscription of business | |
CN110147817B (en) | Training data set generation method and device | |
CN104113622B (en) | Method and apparatus for adding contact information in address book | |
CN106936986A (en) | Application processing method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |