CN107590219A - Webpage personage subject correlation message extracting method - Google Patents

Webpage personage subject correlation message extracting method Download PDF

Info

Publication number
CN107590219A
CN107590219A CN201710783655.4A CN201710783655A CN107590219A CN 107590219 A CN107590219 A CN 107590219A CN 201710783655 A CN201710783655 A CN 201710783655A CN 107590219 A CN107590219 A CN 107590219A
Authority
CN
China
Prior art keywords
text
node
webpage
web page
label
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710783655.4A
Other languages
Chinese (zh)
Inventor
费高雷
周成阳
胡光岷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Electronic Science and Technology of China
Original Assignee
University of Electronic Science and Technology of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Electronic Science and Technology of China filed Critical University of Electronic Science and Technology of China
Priority to CN201710783655.4A priority Critical patent/CN107590219A/en
Publication of CN107590219A publication Critical patent/CN107590219A/en
Pending legal-status Critical Current

Links

Abstract

The invention discloses a kind of webpage personage subject correlation message extracting method.It includes obtaining html web page document, build dom tree corresponding to html web page document, html web page document is pre-processed, calculate the text node rate of each DOM node and text sections processing is carried out to html web page document, Web page text text block is screened, and people information extraction process and people information judge.The present invention effectively solves the problems, such as personage's correlation information extraction in webpage of all kinds, can obtain the people information of complete structuring.

Description

Webpage personage subject correlation message extracting method
Technical field
The invention belongs to web information extraction technique field, more particularly to a kind of webpage personage subject correlation message extraction side Method.
Background technology
With the high speed development of Internet technology, various webpage quantity are in explosive growth.These webpages according to The difference of content is broadly divided into the webpage of the types such as display type, content type, ecommerce type, door type.How from these kinds The crucial useful information of extracting of efficiently and accurately becomes most important in the numerous and diverse website of class, while is chosen there is also sizable War.Because the content in webpage is extremely abundant, existing user wants the subject information browsed, also there is some such as advertisement bars, page Information, these noise informations such as face navigation bar, Products Show, link, website copyright notice are often that user is not intended to what is seen, These noise informations also handle such as Web page classifying and cluster, topic detection, knowledge excavation to the big data of object web page simultaneously Very big interference is caused etc. task.Therefore these noise informations how are removed, and how more completely to extract Web page subject The important data prediction work that information becomes web information retrieval now or even web information excavates.
The category that these subject informations belong to web information extraction technique is extracted from webpage, web information, which extracts, to be referred to from half Data are extracted in the web document of structuring, and convert thereof into more structuring, semantic relatively sharp data represent.Webpage Subject information is extracted mainly around in the processing of the following aspects:One, the method based on template matches.This method is main The template shared based on the page in website, by the way that the template of website is identified, is then matched using template to the page To identify Topical Information from Web Pages;Two, the method based on heuristic rule.This aspect includes various different processing side Method, it can substantially be subdivided into based on HTML structure feature, be opened based on HTML content feature and based on HTML visual signatures to build again The method of hairdo rule;Three, the method based on machine learning.This method is primarily adapted for use in the place of extensive web data collection Reason, is trained Topical Information from Web Pages disaggregated model first with the web data manually marked, is then identified using grader Theme and not a theme information in webpage.
HTML (HyperText Markup Language, abbreviation HTML) is the basic language of program Speech." hypertext " just refers to that picture can be included in the page, link, or even the non-legible element such as music, program.Webpage is also referred to as Html document, by the way that other web technologies are used in combination (such as:Script, CGI, component etc.), it can create Powerful webpage.These html documents include html tag and plain text using .htm or .html as extension name.Standard Html document all there is a basic overall structure, html tag is the keyword surrounded by angle brackets, such as<html>.Mark Note is general to be occurred in pairs, such as<b>With</b>, first label of these label centerings is to start label, and second label is End-tag.Generally we are shown by the webpage that browser is seen after browser resolves html documents, browser Html tag will not be shown, but content of pages is explained using label.
In subject information extraction process is carried out to webpage, because the initial data of acquisition is exactly html document, therefore need There is a clear understanding to HTML syntax structure.The important information that can be extracted from webpage is typically derived from<head>With< body>In label.
DOM Document Object Model (Document Object Model, abbreviation DOM), is that the processing of W3C Organisation recommendations is expansible The standard programming interface of markup language.On webpage, the object of the tissue page (or document) is organized in a tree structure, For representing that the master pattern of object in document is known as DOM.DOM provides each element property and method in accession page Interface.Each webpage has corresponded to a dom tree, can be in the page at each element by being traveled through to dom tree Reason.Each node of tree is an object.DOM model not only describes the structure of document, also defines the behavior of node object, It using the method and attribute of object, can easily access, change, addition and the node and content for deleting dom tree.
The core concept of Web page subject extracting method based on template is that the content repeated in webpage is regarded as template, And think that these contents are noise informations, then by the template progress of pending webpage and training webpage collection auto-building html files Match somebody with somebody, in matching result not template row content be considered as Web page subject.
Webpage is had Bar-Yossef et al. into separated layout and the single region unit of style is as pagelet, is based on Dom tree is split and detection template such as navigation bars using pagelet to webpage, core content block, and advertisement etc. is just each From for a pagelet, the effect of Web de-noising is reached by noise data existing for deleting in the form of Page Template.This It is that Web de-noising ratio is attempted earlier.Shian-Hua Lin et al. propose to utilize<TABLE>The method of label and comentropy, lead to Cross utilization<TABLE>To divide webpage, webpage is divided into by content blocks and noise block by the comentropy for calculating each page block, Noise is removed finally by the height of comparison information entropy.One deficiency of this method is excessively to rely on<TABLE>Label, and With HTML development, webpage design also gradually abandons use<TABLE>Come the way being laid out.Lochovsky et al. proposes DSE (Data-rich Section Extraction) algorithm, for the page in same website, top-down matching template phase The dom tree of the same page, regards overlapping in matching result or identical structure as not a theme information, by leaf node Appearance regards subject content as, and extracts.Liu et al. proposes style tree SST (Site Style Tree), algorithm basic thought It is similar with DSE:First against the style tree of one page level of a Website construction, each node in style tree is according in it Hold feature and visual signature to calculate the compound importance of egress, noise node and master are identified finally by important degree Inscribe information node.This method be ultimately used to webpage classification and cluster task, test result indicates that this method can realize compared with Good extraction effect, disadvantage however is that to establish corresponding style tree for different websites.Gupta et al. proposes to use machine The method of device autonomous learning automatic identification web page template, develop a webpage and act on behalf of instrument --- Content Extraction, for the noise in filtering web page, and by regulation rule collection the particle of filtering content can be controlled big It is small.Europe builds text et al. and proposes the regression algorithm generation template based on machine learning, by detecting the relation between linking, identifies anchor The feature of text establishes the template of the page and extracting rule, and last application template carries out the extraction of body matter.Chen etc. People establishes process in the index for proposing for template detection to be integrated into search engine, first to web page release, then utilizes pattern Piecemeal is clustered etc. visual information, the block that behaves like will be laid out in different web pages be determined as the template of webpage.
Because the Topical Information from Web Pages abstracting method based on template needs processed webpage to have certain special structure Or need the design feature of prior learning to target web to be handled well, therefore a part of scholar it is also proposed It need not anticipate or the Topical Information from Web Pages extracting method of learning objective webpage.As Weninger et al. proposes a kind of base In text label ratio (TR) method for extracting content.Pass through non-tag characters number and label in calculating html documents per a line The ratio of number builds the TR distribution histograms of whole html documents, then determines Web page subject using Threshold sementation Topical Information from Web Pages is extracted with the optimal separation threshold value of not a theme part in part.
But when the distribution of html texts than it is sparse when, simultaneously because when the word that contains of html headers or footer is more, this The body of text that kind method determines often just fails.
At present, due to the Topical Information from Web Pages extraction based on template, primarily directed on internet, there is largely pass through For the webpage that the mode that reading database data are subsequently filled uniform template automatically generates come what is accounted for, this kind of webpage is general From same website and HTML structure it is more similar.And the universal handle of these methods<div>Text in label is as webpage Main part, web page template is generated by learning the html tag structure of these webpages, while manually mark some there is master The label of information is inscribed, the main information label that can marks according to these when the webpage of one similar templates of input is come to master Topic information is extracted.Can significantly it find, when the web page source run into is in various websites, due to various webpages The difference of template, the web page template for causing to extract do not have universality.Simultaneously as the structural information source of present webpage From<div>Label is changed into be obtained from CSS (CSS), and institute just fails in these processes.
Therefore, when in face of miscellaneous webpage, how accurately, the efficient extraction for carrying out Topical Information from Web Pages into For the emphasis and difficult point studied now.
The content of the invention
The present invention goal of the invention be:In order to solve problem above present in prior art, the present invention is directed to this net A kind of complicated situation of page structure type, it is proposed that the webpage personage subject correlation message extracting method of non-template.
The technical scheme is that:A kind of webpage personage subject correlation message extracting method, comprises the following steps:
A, a html web page document for including character motif relevant information is obtained;
B, dom tree corresponding to html web page document in construction step A;
C, html web page document in step A is pre-processed;
D, the dom tree in step B calculates the text node rate of each DOM node, enters style of writing to html web page document The processing of this piecemeal;
E, Web page text text block screening is carried out according to the text node rate and text sections obtained in step D;
F, people information extraction process is carried out to the body text block obtained in step E;
G, whether people information is included in the information for extracting to obtain in judgment step F;If so, then to being extracted in step F The people information arrived carries out structuring processing;If it is not, then return to step A.
Further, the step C is pre-processed to html web page document in step A specially deletes html web page text Tag set can be neglected in shelves, the negligible tag set includes<script>Label,<style>Label,<br>Label,< select>Label,<input>Label,<label>Label,<comment>Label and<nav>Label.
Further, the text node rate of dom tree calculating each DOM nodes of the step D in step B is specially With<body>Label, which is used as, originates root node, the text node rate of each DOM node under recursive calculation dom tree.
Further, the calculation formula of the text node rate for calculating each DOM node is
Wherein, CNR (n) is node n text node rate, and CountText (n) is all text character numbers under node n, CountNode (n) is all DOM node numbers under node n.
Further, the step E specially will according to text sections processing is carried out in step D to html web page document< body>Under label first order child nodes as polymerization father node, text node rate under father node be 0 knot removal, Under the node aggregation not equal to 0 to the larger node of text node rate.
Further, the step E carries out Web page text text according to the text node rate and text sections obtained in step E The screening of this block is specially the text node rate of the node according to belonging to text sections, chooses text node rate maximum and text character Most text sections are as Web page text text block.
Further, the step F is specially to the body text block progress people information extraction process obtained in step E The body text block obtained in step E is segmented, and the word that cutting is obtained carries out normalization process, then extract reflection The keyword of character motif relevant information simultaneously carries out classification annotation, calculates all kinds of mark weights.
Further, it is specially to non-structural that the people information for extracting to obtain in the F to step, which carries out structuring processing, Change people information to carry out structuring processing and carry out structuring processing to semi-structured people information.
Further, it is described that structuring processing is carried out to unstructured people information specially first to unstructured personage Information carries out subordinate sentence processing, then carries out part-of-speech tagging and syntactic analysis for each sentence, it is determined that subject, predicate, guest in sentence Language relation, extract the nominal phrase in object and form structuring people information with subject.
Further, it is described to carry out structuring to semi-structured people information to handle being specially to use to be based on character attribute word The set out method of matched rule of allusion quotation extracts people information in semi-structured people information, forms structuring people information.
The beneficial effects of the invention are as follows:Html document is parsed into dom tree by the present invention using DOM technologies, is carried using dom tree The API of confession is pre-processed to html document, then the text message in webpage is extracted, with natural language processing Participle, part-of-speech tagging, name Entity recognition etc. technology the body text information after extraction is carried out personage's correlation information extraction and Judge, semi-structured or non-structured people information is converted into structuring finally by character attribute dependency rule is formulated People information, effectively solve the problems, such as personage's correlation information extraction in webpage of all kinds, complete structure can be obtained The people information of change.
Brief description of the drawings
Fig. 1 is the schematic flow sheet of the webpage personage subject correlation message extracting method of the present invention.
Fig. 2 is that html web page document D OM trees build schematic diagram in the embodiment of the present invention.
Fig. 3 is that marking state shifts schematic diagram in the embodiment of the present invention.
Fig. 4 is that structure state transfer schematic diagram is set in the embodiment of the present invention.
Fig. 5 is interior joint of embodiment of the present invention n DOM tree structure schematic diagram.
Fig. 6 is each Node distribution schematic diagram of dom tree in the embodiment of the present invention.
Fig. 7 is the dom tree schematic diagram after text sections in the embodiment of the present invention.
Fig. 8 is unstructured people information extraction schematic diagram in the embodiment of the present invention.
Embodiment
In order to make the purpose , technical scheme and advantage of the present invention be clearer, it is right below in conjunction with drawings and Examples The present invention is further elaborated.It should be appreciated that specific embodiment described herein is only to explain the present invention, not For limiting the present invention.
As shown in figure 1, the schematic flow sheet of the webpage personage subject correlation message extracting method for the present invention.A kind of webpage Character motif relevant information extracting method, comprises the following steps:
A, a html web page document for including character motif relevant information is obtained;
B, dom tree corresponding to html web page document in construction step A;
C, html web page document in step A is pre-processed;
D, the dom tree in step B calculates the text node rate of each DOM node, enters style of writing to html web page document The processing of this piecemeal;
E, Web page text text block screening is carried out according to the text node rate and text sections obtained in step D;
F, people information extraction process is carried out to the body text block obtained in step E;
G, whether people information is included in the information for extracting to obtain in judgment step F;If so, then to being extracted in step F The people information arrived carries out structuring processing;If it is not, then return to step A.
Present invention is generally directed to English webpage carry out people information extraction, using to some dictionaries be for English text This.
In stepb, dom tree corresponding to html web page document in construction step A of the present invention, i.e., html web page document is turned Change DOM tree structure into.The html web page document of input is parsed first, generates dom tree.DOM DOM Document Object Models (Document Object Model) is W3C standard, is a set of api interface based on browser programming.What the present invention used It is HTML DOM, which defines the object and attribute of all HTML elements, and the method for accessing them.Html document is parsed The resolving of generation HTML dom trees is mainly divided to two algorithms to be realized, is divided into marking (Tokeniser) and tree structure (Tree Construction).Marking algorithm is morphological analysis process, and input content is parsed into multiple marks.HTML is marked Note includes start mark, end mark, Property Name and property value.Maker identification mark is marked, passes to tree constructor, so It is followed by by character late to identify next mark, until the end of input.As shown in Fig. 2 in the embodiment of the present invention Html web page document D OM trees build schematic diagram.
The input results for the marking algorithm that mark maker uses are HTML markups, and use state machine represents.State machine One shares 4 states:Data mode (Data), mark open mode (Tag open), mark name status (Tag name), pass Close mark open mode (Close tag open state).As shown in figure 3, shifted for marking state in the embodiment of the present invention Schematic diagram.
Dom tree structure is created while being marked algorithm.Tree the structure stage, using Document as The dom tree of root node also can constantly modify, and add various elements.The each node for marking maker to send can be by tree structure Device is built to be handled.Each mark has corresponding DOM element, and these elements can create when receiving mark.Dom tree simultaneously In element can also be added to one and be referred to as in the storehouse of open element, for the mark for correcting nesting error and processing is not turned off Note.Tree developing algorithm can equally be described with state machine, and corresponding state transfer is completed by constantly receiving html tag And DOM node creates work.As shown in figure 4, shift schematic diagram to set structure state in the embodiment of the present invention.
In step C, the present invention is in view of some modified labels in webpage html document for text extraction It can be ignored, text message has been typically free of in these labels, therefore the API provided using DOM is to HTML nets in step A Page document is pre-processed, and removes these negligible labels in advance.
The present invention is pre-processed specially to delete to html web page document in step A can be neglected mark in html web page document Label set, the negligible tag set include<script>Label,<style>Label,<br>Label,<select>Label,< input>Label,<label>Label,<comment>Label and<nav>Label.Above-mentioned negligible label is illustrated below:
(1)<script>Label, for defining client script, such as JavaScript;
(2)<style>Label, for defining style information for html document;
(3)<select>Label, for creating single choice or multiselect menu;
(4)<input>Label, for provide can input data wherein field;
(5)<label>Label, marked for input element definitions;
(6)<nav>Label, define the part of navigation link.
In step D, dom tree of the present invention in step B calculates the text node rate of each DOM node, to HTML Web document carries out text sections processing, i.e., the text in webpage is excavated using DOM tree structure feature and piecemeal.Right After html document is pre-processed, it has been substantially present in the HTML<div>,<table>,<li>Etc. text may be contained Container Type label, to complete to Text Feature Extraction that may be present in these labels, this patent proposes a kind of new based on DOM The method of text node rate (CNR) is extracted to these texts, while reaches the purposes of text sections (purpose is by webpage The texts such as text message, advertising message, navigation bar information, footer website copyright information are respectively placed in different piecemeals).
DOM node text node rate is referred to as (CNR, chars nodes ratio), and its computational methods is institute under the node The ratio between node total number under the text character sum contained and the node, is expressed as
Wherein, CNR (n) is node n text node rate, and CountText (n) is all text character numbers under node n, CountNode (n) is all DOM node numbers under node n.
As shown in figure 5, the DOM tree structure schematic diagram for interior joint of embodiment of the present invention n.Node n Dom trees and its subtree CNR values be calculated as:
Include " Hello " and " World!" amounting to 11 text characters, it is wrapped Contain 4 nodes (n, n1, t1, t2).
Amount to 5 characters comprising " Hello ", its include 2 nodes (n1, t1)。
It can similarly obtain
There are two important properties for node n of the intra-node containing text CNR values:
(1) if n has CNR (n) containing only 1 child node nc<CNR(nc);
(2) if n contains multiple child node n1, contain text message in n2, n3 ... nk and these child nodes, then have CNR (n)>CNR (ni), i ∈ k.
Because html document all has DOM tree structure, and web page text is all located at<body>Within label (<head>Label Definition document header information, the essential information of webpage is provided for search engine, will not shown in a browser), so in the portion , will when dividing progress DCNR calculating<body>Starting starting point of the label as the algorithm.Each DOM node is calculated according to dom tree Text node rate be specially with<body>Label, which is used as, originates root node, the text section of each DOM node under recursive calculation dom tree Point rate.
After the CNR values of each node are obtained, according to the property above for CNR, it is possible to saved same father is belonged to The text of the lower text node of point is polymerize.Specific rules be<body>Father of the first order child nodes as polymerization under label Node, knot removals of the DCNR under these nodes equal to 0, under the node the node aggregation not equal to 0 to larger DCNR, this Sample completes the text sections operation of different zones in webpage.As shown in fig. 6, it is each node of dom tree point in the embodiment of the present invention Cloth schematic diagram, circle node represent Html container type label nodes, and square frame node represents the text chunk in Html.As shown in fig. 7, For the dom tree schematic diagram in the embodiment of the present invention after text sections.
In step E, the present invention carries out Web page text text according to the text node rate and text sections obtained in step D Block is screened, i.e., Web page text information text block is further filtered out from the web page text block for divided block.Complete webpage text After this piecemeal, the text in webpage is divided into 4 main piece of texts, respectively block 1, block 2, block 3, and main piece of block 4. these texts All it is belonging respectively to<div>1,<div>2,<div>3,<div>Under 4 nodes, distinguishing rule of the part to Web page text text block It is the CNR values according to these nodes, the block that selection CNR values are maximum and text character is most is as Web page text text block.Generally Text in block 1 and block 2 is web page navigation bar mostly, the content of text in search column, and the content of text in block 4 is mostly webpage footer The text message such as such as copyright, website contacts address.According to above-mentioned distinguishing rule, it may be determined that the text in the webpage Block of information is block 3.By the processing of this method, the body text information in webpage can be preferably extracted.
In step F, the present invention carries out people information extraction process to the body text block obtained in step E, i.e., from Personage's relevant information is extracted in the body text of acquisition, and by judging to determine whether people information be present in the webpage, such as The people information that these unstructured or semi-structured people informations are then expressed as to structuring be present in fruit.By to step E In obtained body text block segmented, and the word that cutting is obtained carries out normalization process, then extracts reflection people owner Inscribe the keyword of relevant information and carry out classification annotation, calculate all kinds of mark weights.
(1) participle and normalization process
Participle and normalization process refer to segment text from Web page text text message same to be syncopated as word When to these words carry out normalization process.Word segmentation processing employs rule-based method, with Python regular expressions Cutting is carried out to the word in text and each punctuation mark;Then these words are carried out with normalization process operation, it is such as unified single The processing such as word capital and small letter, part punctuation mark removal.
(2) Keywords matching and weight calculation
Keywords matching and weight calculation refer to extract the keyword that can reflect people information from text and to them Carry out associated weight calculating.After being segmented to text, Entity recognition processing is named to text, make use of The name Entity recognition device of StanfordNLP group developments is labeled to text.Using personage (Person), place name (Location), organization (Organization), other (Misc) four class models are instructed to the name Entity recognition device Practice.Because this patent focuses on to consider the extraction work to webpage personage information, therefore this four classes label is assigned respectively different Weight, as personage marks, weight is maximum, and organization's weight is taken second place, and third, other mark weights are minimum for place name weight.
In step G, the present invention is by calculating Keywords matching in these text blocks and weight sum, with passing through in advance The keyword threshold value that experiment statisticses obtain is compared, and judges whether text block contains people information.Enter one if containing The step structuring people information, exits handling process or re-enters a new html document being handled if not containing
Structuring people information refers to after it is determined that some webpage contains people information, is further extracted from the webpage Personage's relevant information and the process that these information are expressed as to structuring people information.People information on webpage is mostly with non-knot Structure or semi-structured form occur.Due to having larger difference, therefore this between unstructured information and semi-structured information Invention proposes two kinds of Different Strategies to be handled for both of these case.
For the extraction of semi-structured people information, because this class text character attribute has had portion with personage's relevant information Relation corresponding to point, this patent employs a kind of method based on character attribute dictionary triggering matched rule, while utilizes name The result of entity mark is extracted and structuring to the people information in webpage.If table 1 below is that part attribute word corresponds to Extracting rule example under different synonyms.
Table 1, part character attribute word and the extracting rule table of comparisons
For the extraction of unstructured people information, because these people informations are all in one section of long text, therefore can not Only set up rule is extracted by semi-structured people information to be extracted.This patent employs a kind of non-structured Extraction algorithm, the algorithm include a series of processing means to be extracted to the people information in the non-structured text.First Subordinate sentence processing is carried out to long text, punctuate processing is carried out as mark using the fullstop occurred in text.Then each sentence is directed to respectively Son carries out part-of-speech tagging and syntactic analysis, determines the subject in this, predicate, object relation.Then pay close attention to personage emphatically Association attributes synonym is subject, the sentence using nominal phrase as object.To the predicate knot such as verb between subject and object Structure carries out judging whether to be subordinated to the subject vocabulary, if it is the nominal phrase of object is extracted.As shown in figure 8, Schematic diagram is extracted for unstructured people information in the embodiment of the present invention.
The present invention to webpage first by parsing, DOM tree structure corresponding to generation, then by calculating text label The mode of node rate, filters out Web page text from webpage, due to having got rid of some such as web page navigations in the Web page text The noise informations such as bar, advertiser web site, website copyright information, therefore the Web page text can be as the high-quality of various text data diggings Data source;By the analysis to Web page text text message, a pair keyword related to personage's association attributes matched and Weight calculation, so as to preferably tell whether the webpage contains people information;For the Web page text containing people information Text, two kinds of different processing strategies are employed to handle non-structured text information and semi-structured text information, People information is therefrom preferably extracted, and is translated into the people information of structuring.
One of ordinary skill in the art will be appreciated that embodiment described here is to aid in reader and understands this hair Bright principle, it should be understood that protection scope of the present invention is not limited to such especially statement and embodiment.This area Those of ordinary skill can make according to these technical inspirations disclosed by the invention various does not depart from the other each of essence of the invention The specific deformation of kind and combination, these deform and combined still within the scope of the present invention.

Claims (10)

1. a kind of webpage personage subject correlation message extracting method, it is characterised in that comprise the following steps:
A, a html web page document for including character motif relevant information is obtained;
B, dom tree corresponding to html web page document in construction step A;
C, html web page document in step A is pre-processed;
D, the dom tree in step B calculates the text node rate of each DOM node, and text point is carried out to html web page document Block processing;
E, Web page text text block screening is carried out according to the text node rate and text sections obtained in step D;
F, people information extraction process is carried out to the body text block obtained in step E;
G, whether people information is included in the information for extracting to obtain in judgment step F;If so, then extraction in step F is obtained People information carries out structuring processing;If it is not, then return to step A.
2. webpage personage subject correlation message extracting method as claimed in claim 1, it is characterised in that the step C is to step Html web page document, which is pre-processed specially to delete, in rapid A can be neglected tag set in html web page document, described negligible Tag set includes<script>Label,<style>Label,<br>Label,<select>Label,<input>Label,<label >Label,<comment>Label and<nav>Label.
3. webpage personage subject correlation message extracting method as claimed in claim 1, it is characterised in that the step D according to The text node rate that dom tree in step B calculates each DOM node be specially with<body>Label is passed as starting root node Return the text node rate for calculating each DOM node under dom tree.
4. webpage personage subject correlation message extracting method as claimed in claim 3, it is characterised in that described to calculate each The calculation formula of the text node rate of DOM node is
<mrow> <mi>C</mi> <mi>N</mi> <mi>R</mi> <mrow> <mo>(</mo> <mi>n</mi> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mrow> <mi>C</mi> <mi>o</mi> <mi>u</mi> <mi>n</mi> <mi>t</mi> <mi>T</mi> <mi>e</mi> <mi>x</mi> <mi>t</mi> <mrow> <mo>(</mo> <mi>n</mi> <mo>)</mo> </mrow> </mrow> <mrow> <mi>C</mi> <mi>o</mi> <mi>u</mi> <mi>n</mi> <mi>t</mi> <mi>N</mi> <mi>o</mi> <mi>d</mi> <mi>e</mi> <mrow> <mo>(</mo> <mi>n</mi> <mo>)</mo> </mrow> </mrow> </mfrac> </mrow>
Wherein, CNR (n) is node n text node rate, and CountText (n) is all text character numbers under node n, CountNode (n) is all DOM node numbers under node n.
5. webpage personage subject correlation message extracting method as claimed in claim 4, it is characterised in that the step E according to Text sections processing is carried out in step D to html web page document specially will<body>First order child nodes are as poly- under label The father node of conjunction, the knot removal that text node rate under father node is 0, the node aggregation not equal to 0 to text node rate Under larger node.
6. webpage personage subject correlation message extracting method as claimed in claim 1, it is characterised in that the step E according to It is specially according to belonging to text sections that the text node rate and text sections obtained in step E, which carries out the screening of Web page text text block, Node text node rate, choose text node rate is maximum and text character is most text sections as Web page text text Block.
7. webpage personage subject correlation message extracting method as claimed in claim 1, it is characterised in that the step F is to step It is specially that the body text block obtained in step E is carried out that the body text block obtained in rapid E, which carries out people information extraction process, Participle, and the word that cutting is obtained carries out normalization process, then extract and reflect that the keyword of character motif relevant information is gone forward side by side Row classification annotation, calculate all kinds of mark weights.
8. webpage personage subject correlation message extracting method as claimed in claim 1, it is characterised in that in the F to step It is specially that structuring processing and double are carried out to unstructured people information to extract obtained people information to carry out structuring to handle Structuring people information carries out structuring processing.
9. webpage personage subject correlation message extracting method as claimed in claim 8, it is characterised in that described to unstructured It is specially to carry out subordinate sentence processing to unstructured people information first that people information, which carries out structuring processing, then for each sentence Part-of-speech tagging and syntactic analysis are carried out, it is determined that subject, predicate, object relation in sentence, extracts nominal phrase in object simultaneously Structuring people information is formed with subject.
10. webpage personage subject correlation message extracting method as claimed in claim 8, it is characterised in that described to half structure It is specially the method extraction half hitch using the matched rule that set out based on character attribute dictionary that change people information, which carries out structuring processing, People information in structure people information, form structuring people information.
CN201710783655.4A 2017-09-04 2017-09-04 Webpage personage subject correlation message extracting method Pending CN107590219A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710783655.4A CN107590219A (en) 2017-09-04 2017-09-04 Webpage personage subject correlation message extracting method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710783655.4A CN107590219A (en) 2017-09-04 2017-09-04 Webpage personage subject correlation message extracting method

Publications (1)

Publication Number Publication Date
CN107590219A true CN107590219A (en) 2018-01-16

Family

ID=61050702

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710783655.4A Pending CN107590219A (en) 2017-09-04 2017-09-04 Webpage personage subject correlation message extracting method

Country Status (1)

Country Link
CN (1) CN107590219A (en)

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108520007A (en) * 2018-03-15 2018-09-11 江河瑞通(北京)技术有限公司 Web page information extracting method, storage medium and computer equipment
CN108829696A (en) * 2018-04-18 2018-11-16 西安理工大学 Towards knowledge mapping node method for auto constructing in metro design code
CN108920434A (en) * 2018-06-06 2018-11-30 武汉酷犬数据科技有限公司 A kind of general Web page subject method for extracting content and system
CN109325197A (en) * 2018-08-17 2019-02-12 百度在线网络技术(北京)有限公司 Method and apparatus for extracting information
CN109710833A (en) * 2018-12-29 2019-05-03 上海蜜度信息技术有限公司 For determining the method and apparatus of content node
CN109977370A (en) * 2019-03-19 2019-07-05 河海大学常州校区 It is a kind of based on the question and answer of document collection partition to method for auto constructing
CN110110193A (en) * 2019-04-24 2019-08-09 北京百炼智能科技有限公司 A kind of information processing method, device and computer readable storage medium
CN110232125A (en) * 2019-06-11 2019-09-13 吉林大学 A method of it carrying out academic people information and extracts and polymerize
JP2020027649A (en) * 2018-08-15 2020-02-20 ベイジン バイドゥ ネットコム サイエンス アンド テクノロジー カンパニー リミテッド Method, apparatus, device and storage medium for generating entity relationship data
CN111625749A (en) * 2020-06-01 2020-09-04 深圳市小满科技有限公司 Method, device, equipment and medium for extracting detail page information of participating company website
CN111698364A (en) * 2020-06-19 2020-09-22 深圳市小满科技有限公司 Contact person information extraction method and related equipment
CN111966932A (en) * 2019-05-20 2020-11-20 富士通株式会社 Information processing method and information processing apparatus
CN112287273A (en) * 2020-10-27 2021-01-29 中国科学院计算技术研究所 Method, system and storage medium for classifying website list pages
CN114201971A (en) * 2021-12-13 2022-03-18 海南港航控股有限公司 Method and system for extracting character attributes from webpage
CN116127079A (en) * 2023-04-20 2023-05-16 中电科大数据研究院有限公司 Text classification method

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104933027A (en) * 2015-06-12 2015-09-23 华东师范大学 Open Chinese entity relation extraction method using dependency analysis
CN106202259A (en) * 2016-06-29 2016-12-07 合肥民众亿兴软件开发有限公司 A kind of info web extracting method based on body thought

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104933027A (en) * 2015-06-12 2015-09-23 华东师范大学 Open Chinese entity relation extraction method using dependency analysis
CN106202259A (en) * 2016-06-29 2016-12-07 合肥民众亿兴软件开发有限公司 A kind of info web extracting method based on body thought

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
苏小鲁: "基于DOM的HTML网页正文信息抽取模块的设计与实现", 《中国优秀硕士学位论文全文数据库信息科技辑》 *

Cited By (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108520007A (en) * 2018-03-15 2018-09-11 江河瑞通(北京)技术有限公司 Web page information extracting method, storage medium and computer equipment
CN108520007B (en) * 2018-03-15 2021-09-28 江河瑞通(北京)技术有限公司 Web page information extracting method, storage medium and computer equipment
CN108829696A (en) * 2018-04-18 2018-11-16 西安理工大学 Towards knowledge mapping node method for auto constructing in metro design code
CN108829696B (en) * 2018-04-18 2019-10-25 西安理工大学 Towards knowledge mapping node method for auto constructing in metro design code
CN108920434A (en) * 2018-06-06 2018-11-30 武汉酷犬数据科技有限公司 A kind of general Web page subject method for extracting content and system
CN108920434B (en) * 2018-06-06 2022-08-30 武汉酷犬数据科技有限公司 Universal webpage theme content extraction method and system
JP2020027649A (en) * 2018-08-15 2020-02-20 ベイジン バイドゥ ネットコム サイエンス アンド テクノロジー カンパニー リミテッド Method, apparatus, device and storage medium for generating entity relationship data
US11321421B2 (en) 2018-08-15 2022-05-03 Beijing Baidu Netcom Science And Technology Co., Ltd. Method, apparatus and device for generating entity relationship data, and storage medium
CN109325197B (en) * 2018-08-17 2022-07-15 百度在线网络技术(北京)有限公司 Method and device for extracting information
CN109325197A (en) * 2018-08-17 2019-02-12 百度在线网络技术(北京)有限公司 Method and apparatus for extracting information
CN109710833B (en) * 2018-12-29 2021-07-16 上海蜜度信息技术有限公司 Method and apparatus for determining content node
CN109710833A (en) * 2018-12-29 2019-05-03 上海蜜度信息技术有限公司 For determining the method and apparatus of content node
CN109977370B (en) * 2019-03-19 2023-06-16 河海大学常州校区 Automatic question-answer pair construction method based on document structure tree
CN109977370A (en) * 2019-03-19 2019-07-05 河海大学常州校区 It is a kind of based on the question and answer of document collection partition to method for auto constructing
CN110110193A (en) * 2019-04-24 2019-08-09 北京百炼智能科技有限公司 A kind of information processing method, device and computer readable storage medium
CN111966932A (en) * 2019-05-20 2020-11-20 富士通株式会社 Information processing method and information processing apparatus
CN110232125B (en) * 2019-06-11 2020-10-02 吉林大学 Method for extracting and aggregating academic figure information
CN110232125A (en) * 2019-06-11 2019-09-13 吉林大学 A method of it carrying out academic people information and extracts and polymerize
CN111625749A (en) * 2020-06-01 2020-09-04 深圳市小满科技有限公司 Method, device, equipment and medium for extracting detail page information of participating company website
CN111625749B (en) * 2020-06-01 2023-08-11 深圳市小满科技有限公司 Method, device, equipment and medium for extracting website detail page information of participant company
CN111698364A (en) * 2020-06-19 2020-09-22 深圳市小满科技有限公司 Contact person information extraction method and related equipment
CN112287273A (en) * 2020-10-27 2021-01-29 中国科学院计算技术研究所 Method, system and storage medium for classifying website list pages
CN112287273B (en) * 2020-10-27 2022-09-30 中国科学院计算技术研究所 Method, system and storage medium for classifying website list pages
CN114201971A (en) * 2021-12-13 2022-03-18 海南港航控股有限公司 Method and system for extracting character attributes from webpage
CN116127079A (en) * 2023-04-20 2023-05-16 中电科大数据研究院有限公司 Text classification method
CN116127079B (en) * 2023-04-20 2023-06-20 中电科大数据研究院有限公司 Text classification method

Similar Documents

Publication Publication Date Title
CN107590219A (en) Webpage personage subject correlation message extracting method
Gatterbauer et al. Towards domain-independent information extraction from web tables
CN102737013B (en) Equipment and the method for statement emotion is identified based on dependence
Zheng et al. Template-independent news extraction based on visual consistency
CN101727498A (en) Automatic extraction method of web page information based on WEB structure
CN105975454A (en) Chinese word segmentation method and device of webpage text
US8577887B2 (en) Content grouping systems and methods
CN102609427A (en) Public opinion vertical search analysis system and method
CN107577671A (en) A kind of key phrases extraction method based on multi-feature fusion
CN112101004B (en) General webpage character information extraction method based on conditional random field and syntactic analysis
Cardoso et al. An efficient language-independent method to extract content from news webpages
CN113196278A (en) Method for training a natural language search system, search system and corresponding use
CN114997288A (en) Design resource association method
CN115017903A (en) Method and system for extracting key phrases by combining document hierarchical structure with global local information
JP2007047974A (en) Information extraction device and information extraction method
CN106372232B (en) Information mining method and device based on artificial intelligence
CN112711666B (en) Futures label extraction method and device
CN107145591A (en) A kind of effective content metadata extracting method of webpage based on title
JP2006309347A (en) Method, system, and program for extracting keyword from object document
CN110020024B (en) Method, system and equipment for classifying link resources in scientific and technological literature
Souza et al. ARCTIC: metadata extraction from scientific papers in pdf using two-layer CRF
Pembe et al. A tree-based learning approach for document structure analysis and its application to web search
Bauer et al. Fiasco: Filtering the internet by automatic subtree classification, osnabruck
Kim et al. Annotated Bibliographical Reference Corpora in Digital Humanities.
CN112347353A (en) Webpage denoising method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20180116