CN106855859B - Webpage text extraction method and device - Google Patents

Webpage text extraction method and device Download PDF

Info

Publication number
CN106855859B
CN106855859B CN201510897907.7A CN201510897907A CN106855859B CN 106855859 B CN106855859 B CN 106855859B CN 201510897907 A CN201510897907 A CN 201510897907A CN 106855859 B CN106855859 B CN 106855859B
Authority
CN
China
Prior art keywords
webpage
information
text
target
text extraction
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510897907.7A
Other languages
Chinese (zh)
Other versions
CN106855859A (en
Inventor
胡又欢
卞维杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Sogou Technology Development Co Ltd
Original Assignee
Beijing Sogou Technology Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Sogou Technology Development Co Ltd filed Critical Beijing Sogou Technology Development Co Ltd
Priority to CN201510897907.7A priority Critical patent/CN106855859B/en
Publication of CN106855859A publication Critical patent/CN106855859A/en
Application granted granted Critical
Publication of CN106855859B publication Critical patent/CN106855859B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3325Reformulation based on results of preceding query

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a webpage text extraction method and a device, which compare text extraction information of at least two target webpages, and confirm node information with the same comparison result in the text extraction information of the at least two target webpages as webpage impurities, wherein the at least two target webpages belong to the same type of webpage; and filtering impurity information of the same type of webpage according to the webpage impurities to obtain the text information of the webpage. The impurity information in the text extraction information of the target webpage of the same type can be determined, and the text extraction information of the target webpage of the same type is filtered according to the impurity information to finally obtain more accurate text information.

Description

Webpage text extraction method and device
Technical Field
The invention relates to the technical field of internet, in particular to a webpage text extraction method and device.
Background
Currently, the extraction of the web page text generally adopts a template-based extraction mode or a text density-based extraction mode, that is, text extraction is performed by selecting fixed nodes or according to nodes with text characteristics. Generally, based on a text extraction scheme selected by nodes, source code information of a webpage is firstly captured by a webpage grabber, then a Document Object Model (DOM) tree is constructed by the source code information of the webpage, then corresponding nodes are selected to extract text information, for example, if a text display area of some webpages is fixed at one node, only the text node needs to be found, and then a text under the text node is taken out, but when impurity information to be removed and the text information are closely arranged and are under the same text node, the prior art cannot remove the impurity information to obtain more accurate text information of the webpage.
Disclosure of Invention
In view of the above problems, embodiments of the present invention are proposed to provide a web page text extraction method and a corresponding apparatus that overcome or at least partially solve the above problems.
In order to solve the technical problem, an embodiment of the present invention provides a method for extracting a text of a web page, including:
comparing the text extraction information of at least two target webpages, and confirming that the node information with the same comparison result in the text extraction information of the at least two target webpages belongs to the same type of webpage as webpage impurities;
and filtering impurity information of the same type of webpage according to the webpage impurities to obtain the text information of the webpage.
The method comprises the steps of comparing text extraction information of at least two target webpages, and confirming that the webpage impurities comprise the same node information in the text extraction information of the at least two target webpages, wherein the node information has the same comparison result;
extracting text extraction information of a first target webpage, storing the text extraction information of the first target webpage into a database corresponding to the type of the first target webpage for initialization;
extracting text extraction information of a next target webpage, and comparing each piece of sub-node information with each piece of sub-node information in the text extraction information of the target webpage stored in the database, wherein the next target webpage and the first target webpage belong to the same type of webpage;
confirming that the child node information with the same comparison result is the webpage impurity, and storing the text extraction information of the next target webpage into a database;
and returning to the step of extracting the text extraction information of the next target webpage until all the target webpages are traversed.
In addition, still include:
setting corresponding counters for all the child node information stored in the database;
according to the comparison result, determining the child node information with the same comparison result as the webpage impurity each time; and adding one to the counter with the comparison result of different child node information, and when the value of the counter reaches a threshold value, no child node information corresponding to the counter is stored in the database.
The child node information comprises text information and/or pictures;
and the comparison between the child node information and the child node information in the text extraction information of the target webpage stored in the database is carried out by comparing the hash code value of the text information of the child node information and/or the picture link information.
In addition, still include:
setting a corresponding counter for the webpage impurities;
when the impurity information of the same type of webpage is filtered according to the webpage impurities, if the text extraction information of the same type of webpage contains impurity information which is the same as the webpage impurities, the counter corresponding to the webpage impurities is reset, if the text extraction information of the same type of webpage does not contain impurity information which is the same as the webpage impurities, the counter corresponding to the webpage impurities is increased by one, and when the value of the counter reaches a threshold value, the webpage impurities corresponding to the counter are not stored.
Wherein, the same type of web pages are web pages belonging to the same WeChat public number.
According to an aspect of the present invention, an apparatus for extracting a text from a web page provided by an embodiment of the present invention includes:
the webpage impurity confirmation processing module is used for comparing the text extraction information of at least two target webpages and confirming the node information with the same comparison result in the text extraction information of the at least two target webpages as the webpage impurity, wherein the at least two target webpages belong to the same type of webpage;
and the filtering processing module is used for filtering impurity information of the same type of webpage according to the webpage impurities to obtain the text information of the webpage.
According to one aspect of the present invention, there is provided an apparatus for webpage text extraction, comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory, and the one or more programs configured to be executed by the one or more processors include instructions for:
comparing the text extraction information of at least two target webpages, and confirming that the node information with the same comparison result in the text extraction information of the at least two target webpages belongs to the same type of webpage as webpage impurities;
and filtering impurity information of the same type of webpage according to the webpage impurities to obtain the text information of the webpage.
Additionally, comprising one or more programs configured to be executed by one or more processors includes instructions for:
extracting text extraction information of a first target webpage, storing the text extraction information of the first target webpage into a database corresponding to the type of the first target webpage for initialization;
extracting text extraction information of a next target webpage, and comparing each piece of sub-node information with each piece of sub-node information in the text extraction information of the target webpage stored in the database, wherein the next target webpage and the first target webpage belong to the same type of webpage;
confirming that the child node information with the same comparison result is the webpage impurity, and storing the text extraction information of the next target webpage into a database;
and returning to the step of extracting the text extraction information of the next target webpage until all the target webpages are traversed.
In addition, still include: the one or more programs configured to be executed by the one or more processors include instructions for:
setting corresponding counters for all the child node information stored in the database;
according to the comparison result, determining the child node information with the same comparison result as the webpage impurity each time; and adding one to the counter with the comparison result of different child node information, and when the value of the counter reaches a threshold value, no child node information corresponding to the counter is stored in the database.
In addition, still include: the one or more programs configured to be executed by the one or more processors include instructions for:
the child node information comprises text information and/or pictures;
and the comparison between the child node information and the child node information in the text extraction information of the target webpage stored in the database is carried out by comparing the hash code value of the text information of the child node information and/or the picture link information.
In addition, still include: the one or more programs configured to be executed by the one or more processors include instructions for:
setting a corresponding counter for the webpage impurities;
when the impurity information of the same type of webpage is filtered according to the webpage impurities, if the text extraction information of the same type of webpage contains impurity information which is the same as the webpage impurities, the counter corresponding to the webpage impurities is reset, if the text extraction information of the same type of webpage does not contain impurity information which is the same as the webpage impurities, the counter corresponding to the webpage impurities is increased by one, and when the value of the counter reaches a threshold value, the webpage impurities corresponding to the counter are not stored.
Wherein, the same type of web pages are web pages belonging to the same WeChat public number.
According to the webpage text extraction method and device provided by the embodiment of the invention, text extraction information of at least two target webpages is compared, and node information with the same comparison result in the text extraction information of the at least two target webpages is confirmed as webpage impurities, wherein the at least two target webpages belong to the same type of webpage; and filtering impurity information of the same type of webpage according to the webpage impurities to obtain the text information of the webpage. The impurity information in the text extraction information of the target webpage of the same type can be determined, and the text extraction information of the target webpage of the same type is filtered according to the impurity information to finally obtain more accurate text information.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in the present invention, and other drawings can be obtained by those skilled in the art according to the drawings.
FIG. 1 is a flowchart of a web page text extraction method according to an exemplary embodiment of the present invention;
FIG. 2 is a flowchart of a method for identifying a foreign object in a web page according to an exemplary embodiment of the present invention;
FIG. 3 is a schematic diagram of a web page article of the XX public number according to an exemplary embodiment of the invention;
FIG. 4 is a schematic diagram of another web page article of the XX public number according to an exemplary embodiment of the invention;
FIG. 5 is a schematic diagram of textual impurity information that is the same for both of the two web pages of FIGS. 3 and 4;
FIG. 6 is a schematic diagram of the same picture impurity information in both of the two web pages of FIGS. 3 and 4;
FIG. 7 is a block diagram of an apparatus for extracting text from a web page according to an exemplary embodiment of the present invention;
fig. 8 is an overall schematic diagram of a web page text extraction apparatus according to an exemplary embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments that can be derived by one of ordinary skill in the art from the embodiments given herein are intended to be within the scope of the present invention.
Please refer to fig. 1, which is a flowchart illustrating a web page text extracting method according to an exemplary embodiment of the present invention. The method for extracting the webpage text mainly comprises the following steps:
step S101, comparing the text extraction information of at least two target webpages, and confirming the same node information in the text extraction information of the at least two target webpages as webpage impurities, wherein the at least two target webpages belong to the same type of webpage;
in particular, in this embodiment, the text extraction information of the target web page is extracted according to text nodes, in practice, the text extraction of the target web page may be performed by using a web crawler to capture source code information of the web page, then a Document Object Model (DOM) tree structure is constructed by using a hypertext markup language (HTML) parser, and the text extraction information may be obtained by selecting corresponding text nodes, or in practice, other web page text extraction methods may be used without specific limitations.
In addition, the same node information in this embodiment refers to the same node information in the text extraction information obtained after the text extraction is performed on the target web page, for example, text information or picture information of a same child node in the text nodes of two target web pages, which is not described herein again.
Step S102, impurity information filtering is carried out on the same type of webpage according to the webpage impurities to obtain text information of the webpage;
in a specific implementation, since the impurity information of the same type of web page is the same, the text extraction information of the same type of web page is removed according to the impurity information confirmed in the step S101, so as to obtain more accurate text information, for example, the impurity information confirmed in the step S101 is used as a sample, and if the text extraction information of other web pages of the same type has information that is the same as the impurity information of the web page used as the sample, the information that is the same as the sample is confirmed as the impurity information, and the impurity information is filtered, so as to obtain more accurate text information corresponding to the web page.
In practice, the webpage impurities in the same type of webpage may not be fixed, that is, the original impurity information may not exist any more or may have new impurity information with time, so as to, as an embodiment, in combination with fig. 2, compare the text extracted information of at least two target webpages, and determine that the node information having the same comparison result in the text extracted information of the at least two target webpages is the webpage impurity, the following method may be adopted:
step S1011, extracting the text extraction information of the first target webpage, storing the text extraction information of the first target webpage into a database corresponding to the type of the first target webpage for initialization;
during specific implementation, for example, a corresponding database may be first established for the type of the first target webpage, and then the text extraction information of the first target webpage is stored in the database for initialization, and in practice, if the text extraction information includes text information and pictures, corresponding data tables may be respectively established for the text information and the pictures for subsequent comparison;
step S1012, extracting text extraction information of a next target web page, and comparing each child node information with each child node information in the text extraction information of the target web page stored in the database, where the next target web page and the first target web page belong to the same type of web page;
it should be noted that the child node information may include text information, and may also include a picture, or it is also possible that both text information and pictures are included, and in order to improve the efficiency of the comparison, the storage pressure of the database may be reduced, for text information, the text information in each child node information can be encoded into a corresponding hash code value according to the hash, that is, the hash code value corresponding to the text information in each sub-node information is stored in the database, and for the picture, the picture link information corresponding to the picture in each sub-node information can be stored in the database, the comparison between the child node information and the child node information in the text extraction information of the target webpage stored in the database is performed by comparing the hash code value and/or the picture link information of the text information in the child node information, which is not described herein again.
Preferably, the child node may be a leaf node in the DOM tree, for example, no other node is included in the area range of the leaf node, and the comparison accuracy can be improved by using the leaf node as the comparison object.
Step S1013, the child node information with the same comparison result is confirmed as the webpage impurity, and the text extraction information of the next target webpage is stored in a database;
in the concrete implementation, if the comparison results are the same, the same child node information can be determined as the webpage impurities, and in addition, in order to avoid missing the webpage impurities or facilitate extracting new webpage impurities, the text extraction information of the next target webpage can be also stored in the database; then, the process returns to step S1012, i.e., the text extraction information of the next target web page is continuously extracted.
It should be noted that, in order to reduce the storage pressure of the database, as a preferred embodiment, a corresponding counter may be further set for each piece of child node information stored in the database;
according to the comparison result, determining the child node information with the same comparison result as the webpage impurity each time; and adding one to the counter with the comparison result of different child node information, and when the value of a certain counter reaches a threshold value, no child node information corresponding to the counter is stored in the database.
In addition, as for the locked webpage impurities, after a period of time, the webpage impurities of the type of webpage may change, so as an optional embodiment, a corresponding counter may be set for the webpage impurities; in step S102, when filtering the impurity information of the same type of web page according to the web page impurity, if there is impurity information identical to the web page impurity in the text extraction information of the same type of web page, the counter corresponding to the web page impurity is cleared, if there is no impurity information identical to the web page impurity in the text extraction information of the same type of web page, the counter corresponding to the web page impurity is incremented, and when the value of a certain counter reaches a threshold value, the web page impurity corresponding to the counter is not stored.
In the following, the target web page is taken as the wechat public number web page for example, because the advertisement information in each web page article of the same wechat public number can be reused, that is, the advertisement information of each web page article of the same wechat public number is the same (or is the same within a period of time), the web pages of the wechat public number are the same type of web pages with the same or similar web page structure, because the advertisement information is not the text information required to be extracted, namely the impurity information, according to the characteristic that the advertisement information is repeated in the same WeChat public number, the webpage impurities can be found out by comparing the text extraction information of at least two target webpages of a certain WeChat public number (for example, the text extraction information of two articles of the WeChat public number in adjacent time intervals), and then the webpage impurity information is removed, so that more accurate text information can be obtained.
When the method is concretely implemented, for example, the WeChat public number is analyzed, a corresponding database is established in advance corresponding to the WeChat public number, temporary data and webpage impurity data serving as samples can be stored in the database, when text extraction is carried out, whether the database is established for the WeChat public number is inquired in the database, if the database is not established, the corresponding database is established for each WeChat public number, wherein two tables are established, one table is used for storing picture information of a text display area, and the other table is used for storing text information of the text display area;
if the database is initialized for the first time, storing a text hash code value obtained by performing hash coding on text information of each sub node in a WeChat webpage article of the WeChat public number and picture link information of each sub node into the database; during specific implementation, for example, a text display area of a target webpage is selected as a body node of the page; analyzing the body node into a DOM tree, extracting text information, namely Chinese character strings, of each leaf node of the DOM tree to perform Hash coding, extracting picture link information of each leaf node, and storing the Hash coding value and the picture link information of the text information in each leaf node into a database.
Then continuing to process the next WeChat webpage article of the WeChat public number, comparing the Hash code value and the picture link information of each leaf node text information in the WeChat webpage article obtained by processing with the stored data in the database, if the comparison result is the same, judging that the same leaf node information is webpage impurity, locking the same leaf node information as the webpage impurity in the database, and keeping the webpage impurity as a sample for a long time, and deleting other data which are not locked, namely when the data table is not initialized for the first time, clearing the information which is not locked in the database, so as to reduce the storage pressure of the database, establishing a corresponding counter for the locked webpage impurity, when continuing to process the subsequent WeChat webpage article of the WeChat public number, if the webpage impurity is not matched, adding 1 to the counter, and clearing the counter when the counter is successfully matched once, and clearing the webpage impurities in the database only when the counter reaches a certain threshold value.
In addition, in order to avoid missing web page impurities or to extract new web page impurities, a first WeChat web page article initialized for the WeChat public is stored in a database, after a second WeChat web page article of the WeChat public is compared with the first WeChat web page article to lock one web page impurity, other information of the first WeChat web page article except the web page impurity information in the database is eliminated, meanwhile, all information of the second WeChat web page article of the WeChat public is stored in the database, then a third WeChat web page article of the WeChat public is continuously processed, and when the third WeChat web page article is compared with the second WeChat web page article in the database, the new web page impurities may be continuously found, namely, the next WeChat web page article of the WeChat public is continuously processed, and the web page impurities are locked by the circular processing, and filtering the WeChat webpage articles of the WeChat public number according to the locked webpage impurities to obtain more accurate text information.
It should be noted that, in this embodiment, an upper limit is not set for the number of the locked web page contaminants, and once the locked web page contaminants are locked, the locked web page contaminants can be used as a web page contaminant sample, and the locked web page contaminants are not released until a counter corresponding to the web page contaminants reaches a threshold (that is, a clearing condition is reached).
The first wechat web page, the second wechat web page and the third wechat web page are only for convenience of description and are not intended to limit the present invention.
The following description will be given taking an example of an extracted text of a wechat public number type web page.
Taking XX public number as an example, wherein articles in two web pages are respectively shown in fig. 3 and fig. 4, it can be seen that the end of the article in the two web pages has advertisement information for propaganda, and the advertisement information in the articles of the two web pages is the same, the advertisement information is web page impurity information, and the web page impurity information and each paragraph of the text belong to a node and are in a parallel relationship, in combination with fig. 5, the web page impurity information in this embodiment includes text information as shown in fig. 5, in addition, the web page impurity information in this embodiment also includes picture impurity information as shown in fig. 6, and actually, articles of other web pages of XX public number also have the same advertisement information (i.e., web page impurity information), but since the wechat public number is customized by a wechat public number manager, the structure of the web page node is not fixed and the web page impurity information and the text information are mixed together, the text information extracted by the fixed text node can include the advertisement information, but the advertisement information under the same WeChat public number generally does not change within a period of time, so that the text extraction information of the two webpages can be compared to obtain the same child node information which can be determined as webpage impurity information (namely the information shown in figures 5 and 6) and stored as a sample, and when the text extraction processing is carried out on other webpages of the XX public number, the text impurity information shown in figure 5 and the picture impurity information shown in figure 6 are removed from the extracted text extraction information to obtain more accurate text information.
Another aspect of the invention is described below.
Referring to fig. 7, this figure is a schematic diagram of a composition of a web page text extracting apparatus according to an exemplary implementation, and this embodiment mainly includes:
a web page impurity confirmation processing module 1, in this embodiment, the web page impurity confirmation processing module 1 is mainly configured to compare text extracted information of at least two target web pages, and confirm that node information with the same comparison result in the text extracted information of the at least two target web pages is a web page impurity, where the at least two target web pages belong to the same type of web page;
and a filtering processing module 2, in this embodiment, the filtering processing module 2 is mainly used for filtering the impurity information of the same type of web page according to the web page impurities to obtain the text information of the web page.
It should be noted that, regarding the apparatus in the above embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated herein.
FIG. 8 is a block diagram illustrating an apparatus 800 for web page body extraction, according to an example embodiment. For example, the apparatus 800 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a tablet device, a personal digital assistant, and the like.
Referring to fig. 8, the apparatus 800 may include one or more of the following components: processing component 802, memory 804, power component 806, multimedia component 808, audio component 810, input/output (I/O) interface 812, sensor component 814, and communication component 816.
The processing component 802 generally controls overall operation of the device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing components 802 may include one or more processors 820 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 802 can include one or more modules that facilitate interaction between the processing component 802 and other components. For example, the processing component 802 can include a multimedia module to facilitate interaction between the multimedia component 808 and the processing component 802.
The memory 804 is configured to store various types of data to support operation at the device 800. Examples of such data include instructions for any application or method operating on device 800, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 804 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.
Power components 806 provide power to the various components of device 800. The power components 806 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the apparatus 800.
The multimedia component 808 includes a screen that provides an output interface between the device 800 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 808 includes a front facing camera and/or a rear facing camera. The front-facing camera and/or the rear-facing camera may receive external multimedia data when the device 800 is in an operating mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.
The audio component 810 is configured to output and/or input audio signals. For example, the audio component 810 includes a Microphone (MIC) configured to receive external audio signals when the apparatus 800 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 804 or transmitted via the communication component 816. In some embodiments, audio component 810 also includes a speaker for outputting audio signals.
The I/O interface 812 provides an interface between the processing component 802 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.
The sensor assembly 814 includes one or more sensors for providing various aspects of state assessment for the device 800. For example, the sensor assembly 814 may detect the open/closed state of the device 800, the relative positioning of the components, such as a display and keypad of the apparatus 800, the sensor assembly 814 may also detect a change in position of the apparatus 800 or a component of the apparatus 800, the presence or absence of user contact with the apparatus 800, orientation or acceleration/deceleration of the apparatus 800, and a change in temperature of the apparatus 800. Sensor assembly 814 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.
The communication component 816 is configured to facilitate communications between the apparatus 800 and other devices in a wired or wireless manner. The device 800 may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 816 receives a broadcast signal or broadcast associated information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communications component 816 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.
In an exemplary embodiment, the apparatus 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.
In an exemplary embodiment, a non-transitory computer-readable storage medium comprising instructions, such as the memory 804 comprising instructions, executable by the processor 820 of the device 800 to perform the above-described method is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.
A non-transitory computer readable storage medium having instructions therein, which when executed by a processor of a mobile terminal, enable the mobile terminal to perform a web page text extraction method, the method comprising:
comparing the text extraction information of at least two target webpages, and confirming that the node information with the same comparison result in the text extraction information of the at least two target webpages belongs to the same type of webpage as webpage impurities;
and filtering impurity information of the same type of webpage according to the webpage impurities to obtain the text information of the webpage.
Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This invention is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.
It will be understood that the invention is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is only limited by the appended claims
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims (11)

1. A webpage text extraction method is characterized by comprising the following steps:
comparing the text extraction information of at least two target webpages, and confirming the node information with the same comparison result in the text extraction information of the at least two target webpages as webpage impurities, which specifically comprises the following steps:
extracting text extraction information of a first target webpage, storing the text extraction information of the first target webpage into a database corresponding to the type of the first target webpage for initialization;
extracting text extraction information of a next target webpage, and comparing each piece of sub-node information with each piece of sub-node information in the text extraction information of the target webpage stored in the database, wherein the next target webpage and the first target webpage belong to the same type of webpage;
confirming that the child node information with the same comparison result is the webpage impurity, and storing the text extraction information of the next target webpage into a database;
returning to the step of extracting the text extraction information of the next target webpage until all the target webpages are traversed;
the at least two target webpages belong to the same type of webpage, and the text extraction information of the at least two target webpages is extracted according to text nodes;
and filtering the extracted text information of the same type of webpage according to the webpage impurities to obtain the text information of the webpage.
2. The method of claim 1, further comprising:
setting corresponding counters for all the child node information stored in the database;
according to the comparison result, determining the child node information with the same comparison result as the webpage impurity each time; and adding one to the counter with the comparison result of different child node information, and when the value of a certain counter reaches a threshold value, no child node information corresponding to the counter is stored in the database.
3. The method according to claim 1, wherein the child node information comprises text information and/or a picture;
and the comparison between the child node information and the child node information in the text extraction information of the target webpage stored in the database is carried out by comparing the hash code value of the text information of the child node information and/or the picture link information.
4. The method of claim 1, further comprising:
setting a corresponding counter for the webpage impurities;
when the impurity information of the same type of webpage is filtered according to the webpage impurities, if the text extraction information of the same type of webpage contains impurity information which is the same as the webpage impurities, the counter corresponding to the webpage impurities is reset, if the text extraction information of the same type of webpage does not contain impurity information which is the same as the webpage impurities, the counter corresponding to the webpage impurities is increased by one, and when the value of a certain counter reaches a threshold value, the webpage impurities corresponding to the counter are not stored.
5. The method of claim 1, wherein the same type of web page is a web page belonging to the same WeChat public number.
6. A web page text extraction apparatus, comprising:
the web page impurity confirmation processing module is configured to compare the text extraction information of at least two target web pages, and confirm that the node information with the same comparison result in the text extraction information of the at least two target web pages is a web page impurity, and specifically includes:
extracting text extraction information of a first target webpage, storing the text extraction information of the first target webpage into a database corresponding to the type of the first target webpage for initialization;
extracting text extraction information of a next target webpage, and comparing each piece of sub-node information with each piece of sub-node information in the text extraction information of the target webpage stored in the database, wherein the next target webpage and the first target webpage belong to the same type of webpage;
confirming that the child node information with the same comparison result is the webpage impurity, and storing the text extraction information of the next target webpage into a database;
returning to the step of extracting the text extraction information of the next target webpage until all the target webpages are traversed;
the at least two target webpages belong to the same type of webpage, and the text extraction information of the at least two target webpages is extracted according to text nodes;
and the filtering processing module is used for filtering the impurity information of the extracted text information of the same type of webpage according to the webpage impurities to obtain the text information of the webpage.
7. An apparatus for web page text extraction comprising a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured for execution by one or more processors the one or more programs include instructions for:
extracting text extraction information of a first target webpage, storing the text extraction information of the first target webpage into a database corresponding to the type of the first target webpage for initialization;
extracting text extraction information of a next target webpage, and comparing each piece of sub-node information with each piece of sub-node information in the text extraction information of the target webpage stored in the database, wherein the next target webpage and the first target webpage belong to the same type of webpage;
confirming that the child node information with the same comparison result is the webpage impurity, and storing the text extraction information of the next target webpage into a database;
returning to the step of extracting the text extraction information of the next target webpage until all the target webpages are traversed;
comparing the text extraction information of at least two target webpages, and confirming that the node information with the same comparison result in the text extraction information of the at least two target webpages is a webpage impurity, wherein the at least two target webpages belong to the same type of webpage, and the text extraction information of the at least two target webpages is extracted according to text nodes;
and filtering the extracted text information of the same type of webpage according to the webpage impurities to obtain the text information of the webpage.
8. The apparatus of claim 7, further comprising: the one or more programs configured to be executed by the one or more processors include instructions for:
setting corresponding counters for all the child node information stored in the database;
according to the comparison result, determining the child node information with the same comparison result as the webpage impurity each time; and adding one to the counter with the comparison result of different child node information, and when the value of a certain counter reaches a threshold value, no child node information corresponding to the counter is stored in the database.
9. The apparatus of claim 7, further comprising: the one or more programs configured to be executed by the one or more processors include instructions for:
the child node information comprises text information and/or pictures;
and the comparison between the child node information and the child node information in the text extraction information of the target webpage stored in the database is carried out by comparing the hash code value of the text information of the child node information and/or the picture link information.
10. The apparatus of claim 7, further comprising: the one or more programs configured to be executed by the one or more processors include instructions for:
setting a corresponding counter for the webpage impurities;
when the impurity information of the same type of webpage is filtered according to the webpage impurities, if the text extraction information of the same type of webpage contains impurity information which is the same as the webpage impurities, the counter corresponding to the webpage impurities is reset, if the text extraction information of the same type of webpage does not contain impurity information which is the same as the webpage impurities, the counter corresponding to the webpage impurities is increased by one, and when the value of a certain counter reaches a threshold value, the webpage impurities corresponding to the counter are not stored.
11. The apparatus of claim 7, wherein the same type of web page is a web page belonging to a same WeChat public number.
CN201510897907.7A 2015-12-08 2015-12-08 Webpage text extraction method and device Active CN106855859B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510897907.7A CN106855859B (en) 2015-12-08 2015-12-08 Webpage text extraction method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510897907.7A CN106855859B (en) 2015-12-08 2015-12-08 Webpage text extraction method and device

Publications (2)

Publication Number Publication Date
CN106855859A CN106855859A (en) 2017-06-16
CN106855859B true CN106855859B (en) 2020-11-10

Family

ID=59132795

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510897907.7A Active CN106855859B (en) 2015-12-08 2015-12-08 Webpage text extraction method and device

Country Status (1)

Country Link
CN (1) CN106855859B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109740101A (en) * 2019-01-18 2019-05-10 杭州凡闻科技有限公司 Data configuration method, public platform article cleaning method, apparatus and system

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1786947A (en) * 2004-12-07 2006-06-14 国际商业机器公司 System, method and program for extracting web page core content based on web page layout
CN101872350A (en) * 2009-04-24 2010-10-27 富士通株式会社 Web page text extracting method and device thereof
CN102810097A (en) * 2011-06-02 2012-12-05 高德软件有限公司 Method and device for extracting webpage text content

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2019361A1 (en) * 2007-07-26 2009-01-28 Siemens Aktiengesellschaft A method and apparatus for extraction of textual content from hypertext web documents
US7765236B2 (en) * 2007-08-31 2010-07-27 Microsoft Corporation Extracting data content items using template matching
CN101826099B (en) * 2010-02-04 2012-09-05 蓝盾信息安全技术股份有限公司 Method and system for identifying similar documents and determining document diffusance
CN102479181B (en) * 2010-11-22 2015-10-07 中国电信股份有限公司 Based on Web page text extracting method and the device of DIV position
CN102541874B (en) * 2010-12-16 2013-11-06 中国移动通信集团公司 Webpage text content extracting method and device
CN102314513B (en) * 2011-09-16 2013-01-02 华中科技大学 Image text semantic extraction method based on GPU (Graphics Processing Unit)
CN102663041B (en) * 2012-03-28 2014-01-01 重庆大学 Automatic extraction method oriented to data of deep web pages
CN103020266B (en) * 2012-12-25 2016-06-29 北京奇虎科技有限公司 The method and apparatus that webpage text content is extracted
CN103955529B (en) * 2014-05-12 2018-05-01 中国科学院计算机网络信息中心 A kind of internet information search polymerize rendering method
CN104376061B (en) * 2014-11-10 2018-01-19 武汉传神信息技术有限公司 A kind of method for extracting Web page text
CN105022803B (en) * 2015-07-01 2018-05-15 广州市万隆证券咨询顾问有限公司 A kind of method and system for extracting Web page text content

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1786947A (en) * 2004-12-07 2006-06-14 国际商业机器公司 System, method and program for extracting web page core content based on web page layout
CN101872350A (en) * 2009-04-24 2010-10-27 富士通株式会社 Web page text extracting method and device thereof
CN102810097A (en) * 2011-06-02 2012-12-05 高德软件有限公司 Method and device for extracting webpage text content

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于布局相似性的网页正文内容提取研究;杨柳青 等;《计算机应用研究》;20150930;第32卷(第9期);第2581-2586页 *

Also Published As

Publication number Publication date
CN106855859A (en) 2017-06-16

Similar Documents

Publication Publication Date Title
CN106170004B (en) Method and device for processing verification code
CN111274426B (en) Category labeling method and device, electronic equipment and storage medium
US10509540B2 (en) Method and device for displaying a message
CN109614482B (en) Label processing method and device, electronic equipment and storage medium
EP3173948A1 (en) Method and apparatus for recommendation of reference documents
US20180121040A1 (en) Method and device for managing notification messages
CN110990801B (en) Information verification method and device, electronic equipment and storage medium
RU2604417C2 (en) Method, device, terminal and server for message pushing via light application
CN104731688B (en) Point out the method and device of reading progress
CN106409317B (en) Method and device for extracting dream speech
CN111523346B (en) Image recognition method and device, electronic equipment and storage medium
CN105786944B (en) Method and device for processing automatic page turning of browser
US20210326649A1 (en) Configuration method and apparatus for detector, storage medium
CN105630780A (en) Webpage information processing method and apparatus
CN105447109A (en) Key word searching method and apparatus
CN111523599B (en) Target detection method and device, electronic equipment and storage medium
EP3125474A1 (en) Method, device and computer program for processing short messages
CN110633715B (en) Image processing method, network training method and device and electronic equipment
CN106331328B (en) Information prompting method and device
CN105704322B (en) Weather information acquisition methods and device
CN104199915A (en) Method and device for detecting webpage changes
CN111629270A (en) Candidate item determination method and device and machine-readable medium
CN110928425A (en) Information monitoring method and device
CN106855859B (en) Webpage text extraction method and device
CN110213062B (en) Method and device for processing message

Legal Events

Date Code Title Description
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant