CN107590219A - Webpage personage subject correlation message extracting method - Google Patents
Webpage personage subject correlation message extracting method Download PDFInfo
- Publication number
- CN107590219A CN107590219A CN201710783655.4A CN201710783655A CN107590219A CN 107590219 A CN107590219 A CN 107590219A CN 201710783655 A CN201710783655 A CN 201710783655A CN 107590219 A CN107590219 A CN 107590219A
- Authority
- CN
- China
- Prior art keywords
- text
- node
- webpage
- web page
- label
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Abstract
The invention discloses a kind of webpage personage subject correlation message extracting method.It includes obtaining html web page document, build dom tree corresponding to html web page document, html web page document is pre-processed, calculate the text node rate of each DOM node and text sections processing is carried out to html web page document, Web page text text block is screened, and people information extraction process and people information judge.The present invention effectively solves the problems, such as personage's correlation information extraction in webpage of all kinds, can obtain the people information of complete structuring.
Description
Technical field
The invention belongs to web information extraction technique field, more particularly to a kind of webpage personage subject correlation message extraction side
Method.
Background technology
With the high speed development of Internet technology, various webpage quantity are in explosive growth.These webpages according to
The difference of content is broadly divided into the webpage of the types such as display type, content type, ecommerce type, door type.How from these kinds
The crucial useful information of extracting of efficiently and accurately becomes most important in the numerous and diverse website of class, while is chosen there is also sizable
War.Because the content in webpage is extremely abundant, existing user wants the subject information browsed, also there is some such as advertisement bars, page
Information, these noise informations such as face navigation bar, Products Show, link, website copyright notice are often that user is not intended to what is seen,
These noise informations also handle such as Web page classifying and cluster, topic detection, knowledge excavation to the big data of object web page simultaneously
Very big interference is caused etc. task.Therefore these noise informations how are removed, and how more completely to extract Web page subject
The important data prediction work that information becomes web information retrieval now or even web information excavates.
The category that these subject informations belong to web information extraction technique is extracted from webpage, web information, which extracts, to be referred to from half
Data are extracted in the web document of structuring, and convert thereof into more structuring, semantic relatively sharp data represent.Webpage
Subject information is extracted mainly around in the processing of the following aspects:One, the method based on template matches.This method is main
The template shared based on the page in website, by the way that the template of website is identified, is then matched using template to the page
To identify Topical Information from Web Pages;Two, the method based on heuristic rule.This aspect includes various different processing side
Method, it can substantially be subdivided into based on HTML structure feature, be opened based on HTML content feature and based on HTML visual signatures to build again
The method of hairdo rule;Three, the method based on machine learning.This method is primarily adapted for use in the place of extensive web data collection
Reason, is trained Topical Information from Web Pages disaggregated model first with the web data manually marked, is then identified using grader
Theme and not a theme information in webpage.
HTML (HyperText Markup Language, abbreviation HTML) is the basic language of program
Speech." hypertext " just refers to that picture can be included in the page, link, or even the non-legible element such as music, program.Webpage is also referred to as
Html document, by the way that other web technologies are used in combination (such as:Script, CGI, component etc.), it can create
Powerful webpage.These html documents include html tag and plain text using .htm or .html as extension name.Standard
Html document all there is a basic overall structure, html tag is the keyword surrounded by angle brackets, such as<html>.Mark
Note is general to be occurred in pairs, such as<b>With</b>, first label of these label centerings is to start label, and second label is
End-tag.Generally we are shown by the webpage that browser is seen after browser resolves html documents, browser
Html tag will not be shown, but content of pages is explained using label.
In subject information extraction process is carried out to webpage, because the initial data of acquisition is exactly html document, therefore need
There is a clear understanding to HTML syntax structure.The important information that can be extracted from webpage is typically derived from<head>With<
body>In label.
DOM Document Object Model (Document Object Model, abbreviation DOM), is that the processing of W3C Organisation recommendations is expansible
The standard programming interface of markup language.On webpage, the object of the tissue page (or document) is organized in a tree structure,
For representing that the master pattern of object in document is known as DOM.DOM provides each element property and method in accession page
Interface.Each webpage has corresponded to a dom tree, can be in the page at each element by being traveled through to dom tree
Reason.Each node of tree is an object.DOM model not only describes the structure of document, also defines the behavior of node object,
It using the method and attribute of object, can easily access, change, addition and the node and content for deleting dom tree.
The core concept of Web page subject extracting method based on template is that the content repeated in webpage is regarded as template,
And think that these contents are noise informations, then by the template progress of pending webpage and training webpage collection auto-building html files
Match somebody with somebody, in matching result not template row content be considered as Web page subject.
Webpage is had Bar-Yossef et al. into separated layout and the single region unit of style is as pagelet, is based on
Dom tree is split and detection template such as navigation bars using pagelet to webpage, core content block, and advertisement etc. is just each
From for a pagelet, the effect of Web de-noising is reached by noise data existing for deleting in the form of Page Template.This
It is that Web de-noising ratio is attempted earlier.Shian-Hua Lin et al. propose to utilize<TABLE>The method of label and comentropy, lead to
Cross utilization<TABLE>To divide webpage, webpage is divided into by content blocks and noise block by the comentropy for calculating each page block,
Noise is removed finally by the height of comparison information entropy.One deficiency of this method is excessively to rely on<TABLE>Label, and
With HTML development, webpage design also gradually abandons use<TABLE>Come the way being laid out.Lochovsky et al. proposes DSE
(Data-rich Section Extraction) algorithm, for the page in same website, top-down matching template phase
The dom tree of the same page, regards overlapping in matching result or identical structure as not a theme information, by leaf node
Appearance regards subject content as, and extracts.Liu et al. proposes style tree SST (Site Style Tree), algorithm basic thought
It is similar with DSE:First against the style tree of one page level of a Website construction, each node in style tree is according in it
Hold feature and visual signature to calculate the compound importance of egress, noise node and master are identified finally by important degree
Inscribe information node.This method be ultimately used to webpage classification and cluster task, test result indicates that this method can realize compared with
Good extraction effect, disadvantage however is that to establish corresponding style tree for different websites.Gupta et al. proposes to use machine
The method of device autonomous learning automatic identification web page template, develop a webpage and act on behalf of instrument --- Content
Extraction, for the noise in filtering web page, and by regulation rule collection the particle of filtering content can be controlled big
It is small.Europe builds text et al. and proposes the regression algorithm generation template based on machine learning, by detecting the relation between linking, identifies anchor
The feature of text establishes the template of the page and extracting rule, and last application template carries out the extraction of body matter.Chen etc.
People establishes process in the index for proposing for template detection to be integrated into search engine, first to web page release, then utilizes pattern
Piecemeal is clustered etc. visual information, the block that behaves like will be laid out in different web pages be determined as the template of webpage.
Because the Topical Information from Web Pages abstracting method based on template needs processed webpage to have certain special structure
Or need the design feature of prior learning to target web to be handled well, therefore a part of scholar it is also proposed
It need not anticipate or the Topical Information from Web Pages extracting method of learning objective webpage.As Weninger et al. proposes a kind of base
In text label ratio (TR) method for extracting content.Pass through non-tag characters number and label in calculating html documents per a line
The ratio of number builds the TR distribution histograms of whole html documents, then determines Web page subject using Threshold sementation
Topical Information from Web Pages is extracted with the optimal separation threshold value of not a theme part in part.
But when the distribution of html texts than it is sparse when, simultaneously because when the word that contains of html headers or footer is more, this
The body of text that kind method determines often just fails.
At present, due to the Topical Information from Web Pages extraction based on template, primarily directed on internet, there is largely pass through
For the webpage that the mode that reading database data are subsequently filled uniform template automatically generates come what is accounted for, this kind of webpage is general
From same website and HTML structure it is more similar.And the universal handle of these methods<div>Text in label is as webpage
Main part, web page template is generated by learning the html tag structure of these webpages, while manually mark some there is master
The label of information is inscribed, the main information label that can marks according to these when the webpage of one similar templates of input is come to master
Topic information is extracted.Can significantly it find, when the web page source run into is in various websites, due to various webpages
The difference of template, the web page template for causing to extract do not have universality.Simultaneously as the structural information source of present webpage
From<div>Label is changed into be obtained from CSS (CSS), and institute just fails in these processes.
Therefore, when in face of miscellaneous webpage, how accurately, the efficient extraction for carrying out Topical Information from Web Pages into
For the emphasis and difficult point studied now.
The content of the invention
The present invention goal of the invention be:In order to solve problem above present in prior art, the present invention is directed to this net
A kind of complicated situation of page structure type, it is proposed that the webpage personage subject correlation message extracting method of non-template.
The technical scheme is that:A kind of webpage personage subject correlation message extracting method, comprises the following steps:
A, a html web page document for including character motif relevant information is obtained;
B, dom tree corresponding to html web page document in construction step A;
C, html web page document in step A is pre-processed;
D, the dom tree in step B calculates the text node rate of each DOM node, enters style of writing to html web page document
The processing of this piecemeal;
E, Web page text text block screening is carried out according to the text node rate and text sections obtained in step D;
F, people information extraction process is carried out to the body text block obtained in step E;
G, whether people information is included in the information for extracting to obtain in judgment step F;If so, then to being extracted in step F
The people information arrived carries out structuring processing;If it is not, then return to step A.
Further, the step C is pre-processed to html web page document in step A specially deletes html web page text
Tag set can be neglected in shelves, the negligible tag set includes<script>Label,<style>Label,<br>Label,<
select>Label,<input>Label,<label>Label,<comment>Label and<nav>Label.
Further, the text node rate of dom tree calculating each DOM nodes of the step D in step B is specially
With<body>Label, which is used as, originates root node, the text node rate of each DOM node under recursive calculation dom tree.
Further, the calculation formula of the text node rate for calculating each DOM node is
Wherein, CNR (n) is node n text node rate, and CountText (n) is all text character numbers under node n,
CountNode (n) is all DOM node numbers under node n.
Further, the step E specially will according to text sections processing is carried out in step D to html web page document<
body>Under label first order child nodes as polymerization father node, text node rate under father node be 0 knot removal,
Under the node aggregation not equal to 0 to the larger node of text node rate.
Further, the step E carries out Web page text text according to the text node rate and text sections obtained in step E
The screening of this block is specially the text node rate of the node according to belonging to text sections, chooses text node rate maximum and text character
Most text sections are as Web page text text block.
Further, the step F is specially to the body text block progress people information extraction process obtained in step E
The body text block obtained in step E is segmented, and the word that cutting is obtained carries out normalization process, then extract reflection
The keyword of character motif relevant information simultaneously carries out classification annotation, calculates all kinds of mark weights.
Further, it is specially to non-structural that the people information for extracting to obtain in the F to step, which carries out structuring processing,
Change people information to carry out structuring processing and carry out structuring processing to semi-structured people information.
Further, it is described that structuring processing is carried out to unstructured people information specially first to unstructured personage
Information carries out subordinate sentence processing, then carries out part-of-speech tagging and syntactic analysis for each sentence, it is determined that subject, predicate, guest in sentence
Language relation, extract the nominal phrase in object and form structuring people information with subject.
Further, it is described to carry out structuring to semi-structured people information to handle being specially to use to be based on character attribute word
The set out method of matched rule of allusion quotation extracts people information in semi-structured people information, forms structuring people information.
The beneficial effects of the invention are as follows:Html document is parsed into dom tree by the present invention using DOM technologies, is carried using dom tree
The API of confession is pre-processed to html document, then the text message in webpage is extracted, with natural language processing
Participle, part-of-speech tagging, name Entity recognition etc. technology the body text information after extraction is carried out personage's correlation information extraction and
Judge, semi-structured or non-structured people information is converted into structuring finally by character attribute dependency rule is formulated
People information, effectively solve the problems, such as personage's correlation information extraction in webpage of all kinds, complete structure can be obtained
The people information of change.
Brief description of the drawings
Fig. 1 is the schematic flow sheet of the webpage personage subject correlation message extracting method of the present invention.
Fig. 2 is that html web page document D OM trees build schematic diagram in the embodiment of the present invention.
Fig. 3 is that marking state shifts schematic diagram in the embodiment of the present invention.
Fig. 4 is that structure state transfer schematic diagram is set in the embodiment of the present invention.
Fig. 5 is interior joint of embodiment of the present invention n DOM tree structure schematic diagram.
Fig. 6 is each Node distribution schematic diagram of dom tree in the embodiment of the present invention.
Fig. 7 is the dom tree schematic diagram after text sections in the embodiment of the present invention.
Fig. 8 is unstructured people information extraction schematic diagram in the embodiment of the present invention.
Embodiment
In order to make the purpose , technical scheme and advantage of the present invention be clearer, it is right below in conjunction with drawings and Examples
The present invention is further elaborated.It should be appreciated that specific embodiment described herein is only to explain the present invention, not
For limiting the present invention.
As shown in figure 1, the schematic flow sheet of the webpage personage subject correlation message extracting method for the present invention.A kind of webpage
Character motif relevant information extracting method, comprises the following steps:
A, a html web page document for including character motif relevant information is obtained;
B, dom tree corresponding to html web page document in construction step A;
C, html web page document in step A is pre-processed;
D, the dom tree in step B calculates the text node rate of each DOM node, enters style of writing to html web page document
The processing of this piecemeal;
E, Web page text text block screening is carried out according to the text node rate and text sections obtained in step D;
F, people information extraction process is carried out to the body text block obtained in step E;
G, whether people information is included in the information for extracting to obtain in judgment step F;If so, then to being extracted in step F
The people information arrived carries out structuring processing;If it is not, then return to step A.
Present invention is generally directed to English webpage carry out people information extraction, using to some dictionaries be for English text
This.
In stepb, dom tree corresponding to html web page document in construction step A of the present invention, i.e., html web page document is turned
Change DOM tree structure into.The html web page document of input is parsed first, generates dom tree.DOM DOM Document Object Models
(Document Object Model) is W3C standard, is a set of api interface based on browser programming.What the present invention used
It is HTML DOM, which defines the object and attribute of all HTML elements, and the method for accessing them.Html document is parsed
The resolving of generation HTML dom trees is mainly divided to two algorithms to be realized, is divided into marking (Tokeniser) and tree structure
(Tree Construction).Marking algorithm is morphological analysis process, and input content is parsed into multiple marks.HTML is marked
Note includes start mark, end mark, Property Name and property value.Maker identification mark is marked, passes to tree constructor, so
It is followed by by character late to identify next mark, until the end of input.As shown in Fig. 2 in the embodiment of the present invention
Html web page document D OM trees build schematic diagram.
The input results for the marking algorithm that mark maker uses are HTML markups, and use state machine represents.State machine
One shares 4 states:Data mode (Data), mark open mode (Tag open), mark name status (Tag name), pass
Close mark open mode (Close tag open state).As shown in figure 3, shifted for marking state in the embodiment of the present invention
Schematic diagram.
Dom tree structure is created while being marked algorithm.Tree the structure stage, using Document as
The dom tree of root node also can constantly modify, and add various elements.The each node for marking maker to send can be by tree structure
Device is built to be handled.Each mark has corresponding DOM element, and these elements can create when receiving mark.Dom tree simultaneously
In element can also be added to one and be referred to as in the storehouse of open element, for the mark for correcting nesting error and processing is not turned off
Note.Tree developing algorithm can equally be described with state machine, and corresponding state transfer is completed by constantly receiving html tag
And DOM node creates work.As shown in figure 4, shift schematic diagram to set structure state in the embodiment of the present invention.
In step C, the present invention is in view of some modified labels in webpage html document for text extraction
It can be ignored, text message has been typically free of in these labels, therefore the API provided using DOM is to HTML nets in step A
Page document is pre-processed, and removes these negligible labels in advance.
The present invention is pre-processed specially to delete to html web page document in step A can be neglected mark in html web page document
Label set, the negligible tag set include<script>Label,<style>Label,<br>Label,<select>Label,<
input>Label,<label>Label,<comment>Label and<nav>Label.Above-mentioned negligible label is illustrated below:
(1)<script>Label, for defining client script, such as JavaScript;
(2)<style>Label, for defining style information for html document;
(3)<select>Label, for creating single choice or multiselect menu;
(4)<input>Label, for provide can input data wherein field;
(5)<label>Label, marked for input element definitions;
(6)<nav>Label, define the part of navigation link.
In step D, dom tree of the present invention in step B calculates the text node rate of each DOM node, to HTML
Web document carries out text sections processing, i.e., the text in webpage is excavated using DOM tree structure feature and piecemeal.Right
After html document is pre-processed, it has been substantially present in the HTML<div>,<table>,<li>Etc. text may be contained
Container Type label, to complete to Text Feature Extraction that may be present in these labels, this patent proposes a kind of new based on DOM
The method of text node rate (CNR) is extracted to these texts, while reaches the purposes of text sections (purpose is by webpage
The texts such as text message, advertising message, navigation bar information, footer website copyright information are respectively placed in different piecemeals).
DOM node text node rate is referred to as (CNR, chars nodes ratio), and its computational methods is institute under the node
The ratio between node total number under the text character sum contained and the node, is expressed as
Wherein, CNR (n) is node n text node rate, and CountText (n) is all text character numbers under node n,
CountNode (n) is all DOM node numbers under node n.
As shown in figure 5, the DOM tree structure schematic diagram for interior joint of embodiment of the present invention n.Node n Dom trees and its subtree
CNR values be calculated as:
Include " Hello " and " World!" amounting to 11 text characters, it is wrapped
Contain 4 nodes (n, n1, t1, t2).
Amount to 5 characters comprising " Hello ", its include 2 nodes (n1,
t1)。
It can similarly obtain
There are two important properties for node n of the intra-node containing text CNR values:
(1) if n has CNR (n) containing only 1 child node nc<CNR(nc);
(2) if n contains multiple child node n1, contain text message in n2, n3 ... nk and these child nodes, then have CNR
(n)>CNR (ni), i ∈ k.
Because html document all has DOM tree structure, and web page text is all located at<body>Within label (<head>Label
Definition document header information, the essential information of webpage is provided for search engine, will not shown in a browser), so in the portion
, will when dividing progress DCNR calculating<body>Starting starting point of the label as the algorithm.Each DOM node is calculated according to dom tree
Text node rate be specially with<body>Label, which is used as, originates root node, the text section of each DOM node under recursive calculation dom tree
Point rate.
After the CNR values of each node are obtained, according to the property above for CNR, it is possible to saved same father is belonged to
The text of the lower text node of point is polymerize.Specific rules be<body>Father of the first order child nodes as polymerization under label
Node, knot removals of the DCNR under these nodes equal to 0, under the node the node aggregation not equal to 0 to larger DCNR, this
Sample completes the text sections operation of different zones in webpage.As shown in fig. 6, it is each node of dom tree point in the embodiment of the present invention
Cloth schematic diagram, circle node represent Html container type label nodes, and square frame node represents the text chunk in Html.As shown in fig. 7,
For the dom tree schematic diagram in the embodiment of the present invention after text sections.
In step E, the present invention carries out Web page text text according to the text node rate and text sections obtained in step D
Block is screened, i.e., Web page text information text block is further filtered out from the web page text block for divided block.Complete webpage text
After this piecemeal, the text in webpage is divided into 4 main piece of texts, respectively block 1, block 2, block 3, and main piece of block 4. these texts
All it is belonging respectively to<div>1,<div>2,<div>3,<div>Under 4 nodes, distinguishing rule of the part to Web page text text block
It is the CNR values according to these nodes, the block that selection CNR values are maximum and text character is most is as Web page text text block.Generally
Text in block 1 and block 2 is web page navigation bar mostly, the content of text in search column, and the content of text in block 4 is mostly webpage footer
The text message such as such as copyright, website contacts address.According to above-mentioned distinguishing rule, it may be determined that the text in the webpage
Block of information is block 3.By the processing of this method, the body text information in webpage can be preferably extracted.
In step F, the present invention carries out people information extraction process to the body text block obtained in step E, i.e., from
Personage's relevant information is extracted in the body text of acquisition, and by judging to determine whether people information be present in the webpage, such as
The people information that these unstructured or semi-structured people informations are then expressed as to structuring be present in fruit.By to step E
In obtained body text block segmented, and the word that cutting is obtained carries out normalization process, then extracts reflection people owner
Inscribe the keyword of relevant information and carry out classification annotation, calculate all kinds of mark weights.
(1) participle and normalization process
Participle and normalization process refer to segment text from Web page text text message same to be syncopated as word
When to these words carry out normalization process.Word segmentation processing employs rule-based method, with Python regular expressions
Cutting is carried out to the word in text and each punctuation mark;Then these words are carried out with normalization process operation, it is such as unified single
The processing such as word capital and small letter, part punctuation mark removal.
(2) Keywords matching and weight calculation
Keywords matching and weight calculation refer to extract the keyword that can reflect people information from text and to them
Carry out associated weight calculating.After being segmented to text, Entity recognition processing is named to text, make use of
The name Entity recognition device of StanfordNLP group developments is labeled to text.Using personage (Person), place name
(Location), organization (Organization), other (Misc) four class models are instructed to the name Entity recognition device
Practice.Because this patent focuses on to consider the extraction work to webpage personage information, therefore this four classes label is assigned respectively different
Weight, as personage marks, weight is maximum, and organization's weight is taken second place, and third, other mark weights are minimum for place name weight.
In step G, the present invention is by calculating Keywords matching in these text blocks and weight sum, with passing through in advance
The keyword threshold value that experiment statisticses obtain is compared, and judges whether text block contains people information.Enter one if containing
The step structuring people information, exits handling process or re-enters a new html document being handled if not containing
Structuring people information refers to after it is determined that some webpage contains people information, is further extracted from the webpage
Personage's relevant information and the process that these information are expressed as to structuring people information.People information on webpage is mostly with non-knot
Structure or semi-structured form occur.Due to having larger difference, therefore this between unstructured information and semi-structured information
Invention proposes two kinds of Different Strategies to be handled for both of these case.
For the extraction of semi-structured people information, because this class text character attribute has had portion with personage's relevant information
Relation corresponding to point, this patent employs a kind of method based on character attribute dictionary triggering matched rule, while utilizes name
The result of entity mark is extracted and structuring to the people information in webpage.If table 1 below is that part attribute word corresponds to
Extracting rule example under different synonyms.
Table 1, part character attribute word and the extracting rule table of comparisons
For the extraction of unstructured people information, because these people informations are all in one section of long text, therefore can not
Only set up rule is extracted by semi-structured people information to be extracted.This patent employs a kind of non-structured
Extraction algorithm, the algorithm include a series of processing means to be extracted to the people information in the non-structured text.First
Subordinate sentence processing is carried out to long text, punctuate processing is carried out as mark using the fullstop occurred in text.Then each sentence is directed to respectively
Son carries out part-of-speech tagging and syntactic analysis, determines the subject in this, predicate, object relation.Then pay close attention to personage emphatically
Association attributes synonym is subject, the sentence using nominal phrase as object.To the predicate knot such as verb between subject and object
Structure carries out judging whether to be subordinated to the subject vocabulary, if it is the nominal phrase of object is extracted.As shown in figure 8,
Schematic diagram is extracted for unstructured people information in the embodiment of the present invention.
The present invention to webpage first by parsing, DOM tree structure corresponding to generation, then by calculating text label
The mode of node rate, filters out Web page text from webpage, due to having got rid of some such as web page navigations in the Web page text
The noise informations such as bar, advertiser web site, website copyright information, therefore the Web page text can be as the high-quality of various text data diggings
Data source;By the analysis to Web page text text message, a pair keyword related to personage's association attributes matched and
Weight calculation, so as to preferably tell whether the webpage contains people information;For the Web page text containing people information
Text, two kinds of different processing strategies are employed to handle non-structured text information and semi-structured text information,
People information is therefrom preferably extracted, and is translated into the people information of structuring.
One of ordinary skill in the art will be appreciated that embodiment described here is to aid in reader and understands this hair
Bright principle, it should be understood that protection scope of the present invention is not limited to such especially statement and embodiment.This area
Those of ordinary skill can make according to these technical inspirations disclosed by the invention various does not depart from the other each of essence of the invention
The specific deformation of kind and combination, these deform and combined still within the scope of the present invention.
Claims (10)
1. a kind of webpage personage subject correlation message extracting method, it is characterised in that comprise the following steps:
A, a html web page document for including character motif relevant information is obtained;
B, dom tree corresponding to html web page document in construction step A;
C, html web page document in step A is pre-processed;
D, the dom tree in step B calculates the text node rate of each DOM node, and text point is carried out to html web page document
Block processing;
E, Web page text text block screening is carried out according to the text node rate and text sections obtained in step D;
F, people information extraction process is carried out to the body text block obtained in step E;
G, whether people information is included in the information for extracting to obtain in judgment step F;If so, then extraction in step F is obtained
People information carries out structuring processing;If it is not, then return to step A.
2. webpage personage subject correlation message extracting method as claimed in claim 1, it is characterised in that the step C is to step
Html web page document, which is pre-processed specially to delete, in rapid A can be neglected tag set in html web page document, described negligible
Tag set includes<script>Label,<style>Label,<br>Label,<select>Label,<input>Label,<label
>Label,<comment>Label and<nav>Label.
3. webpage personage subject correlation message extracting method as claimed in claim 1, it is characterised in that the step D according to
The text node rate that dom tree in step B calculates each DOM node be specially with<body>Label is passed as starting root node
Return the text node rate for calculating each DOM node under dom tree.
4. webpage personage subject correlation message extracting method as claimed in claim 3, it is characterised in that described to calculate each
The calculation formula of the text node rate of DOM node is
<mrow>
<mi>C</mi>
<mi>N</mi>
<mi>R</mi>
<mrow>
<mo>(</mo>
<mi>n</mi>
<mo>)</mo>
</mrow>
<mo>=</mo>
<mfrac>
<mrow>
<mi>C</mi>
<mi>o</mi>
<mi>u</mi>
<mi>n</mi>
<mi>t</mi>
<mi>T</mi>
<mi>e</mi>
<mi>x</mi>
<mi>t</mi>
<mrow>
<mo>(</mo>
<mi>n</mi>
<mo>)</mo>
</mrow>
</mrow>
<mrow>
<mi>C</mi>
<mi>o</mi>
<mi>u</mi>
<mi>n</mi>
<mi>t</mi>
<mi>N</mi>
<mi>o</mi>
<mi>d</mi>
<mi>e</mi>
<mrow>
<mo>(</mo>
<mi>n</mi>
<mo>)</mo>
</mrow>
</mrow>
</mfrac>
</mrow>
Wherein, CNR (n) is node n text node rate, and CountText (n) is all text character numbers under node n,
CountNode (n) is all DOM node numbers under node n.
5. webpage personage subject correlation message extracting method as claimed in claim 4, it is characterised in that the step E according to
Text sections processing is carried out in step D to html web page document specially will<body>First order child nodes are as poly- under label
The father node of conjunction, the knot removal that text node rate under father node is 0, the node aggregation not equal to 0 to text node rate
Under larger node.
6. webpage personage subject correlation message extracting method as claimed in claim 1, it is characterised in that the step E according to
It is specially according to belonging to text sections that the text node rate and text sections obtained in step E, which carries out the screening of Web page text text block,
Node text node rate, choose text node rate is maximum and text character is most text sections as Web page text text
Block.
7. webpage personage subject correlation message extracting method as claimed in claim 1, it is characterised in that the step F is to step
It is specially that the body text block obtained in step E is carried out that the body text block obtained in rapid E, which carries out people information extraction process,
Participle, and the word that cutting is obtained carries out normalization process, then extract and reflect that the keyword of character motif relevant information is gone forward side by side
Row classification annotation, calculate all kinds of mark weights.
8. webpage personage subject correlation message extracting method as claimed in claim 1, it is characterised in that in the F to step
It is specially that structuring processing and double are carried out to unstructured people information to extract obtained people information to carry out structuring to handle
Structuring people information carries out structuring processing.
9. webpage personage subject correlation message extracting method as claimed in claim 8, it is characterised in that described to unstructured
It is specially to carry out subordinate sentence processing to unstructured people information first that people information, which carries out structuring processing, then for each sentence
Part-of-speech tagging and syntactic analysis are carried out, it is determined that subject, predicate, object relation in sentence, extracts nominal phrase in object simultaneously
Structuring people information is formed with subject.
10. webpage personage subject correlation message extracting method as claimed in claim 8, it is characterised in that described to half structure
It is specially the method extraction half hitch using the matched rule that set out based on character attribute dictionary that change people information, which carries out structuring processing,
People information in structure people information, form structuring people information.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710783655.4A CN107590219A (en) | 2017-09-04 | 2017-09-04 | Webpage personage subject correlation message extracting method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710783655.4A CN107590219A (en) | 2017-09-04 | 2017-09-04 | Webpage personage subject correlation message extracting method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107590219A true CN107590219A (en) | 2018-01-16 |
Family
ID=61050702
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710783655.4A Pending CN107590219A (en) | 2017-09-04 | 2017-09-04 | Webpage personage subject correlation message extracting method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107590219A (en) |
Cited By (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108520007A (en) * | 2018-03-15 | 2018-09-11 | 江河瑞通(北京)技术有限公司 | Web page information extracting method, storage medium and computer equipment |
CN108829696A (en) * | 2018-04-18 | 2018-11-16 | 西安理工大学 | Towards knowledge mapping node method for auto constructing in metro design code |
CN108920434A (en) * | 2018-06-06 | 2018-11-30 | 武汉酷犬数据科技有限公司 | A kind of general Web page subject method for extracting content and system |
CN109325197A (en) * | 2018-08-17 | 2019-02-12 | 百度在线网络技术(北京)有限公司 | Method and apparatus for extracting information |
CN109710833A (en) * | 2018-12-29 | 2019-05-03 | 上海蜜度信息技术有限公司 | For determining the method and apparatus of content node |
CN109977370A (en) * | 2019-03-19 | 2019-07-05 | 河海大学常州校区 | It is a kind of based on the question and answer of document collection partition to method for auto constructing |
CN110110193A (en) * | 2019-04-24 | 2019-08-09 | 北京百炼智能科技有限公司 | A kind of information processing method, device and computer readable storage medium |
CN110232125A (en) * | 2019-06-11 | 2019-09-13 | 吉林大学 | A method of it carrying out academic people information and extracts and polymerize |
JP2020027649A (en) * | 2018-08-15 | 2020-02-20 | ベイジン バイドゥ ネットコム サイエンス アンド テクノロジー カンパニー リミテッド | Method, apparatus, device and storage medium for generating entity relationship data |
CN111625749A (en) * | 2020-06-01 | 2020-09-04 | 深圳市小满科技有限公司 | Method, device, equipment and medium for extracting detail page information of participating company website |
CN111698364A (en) * | 2020-06-19 | 2020-09-22 | 深圳市小满科技有限公司 | Contact person information extraction method and related equipment |
CN111966932A (en) * | 2019-05-20 | 2020-11-20 | 富士通株式会社 | Information processing method and information processing apparatus |
CN112287273A (en) * | 2020-10-27 | 2021-01-29 | 中国科学院计算技术研究所 | Method, system and storage medium for classifying website list pages |
CN114201971A (en) * | 2021-12-13 | 2022-03-18 | 海南港航控股有限公司 | Method and system for extracting character attributes from webpage |
CN116127079A (en) * | 2023-04-20 | 2023-05-16 | 中电科大数据研究院有限公司 | Text classification method |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104933027A (en) * | 2015-06-12 | 2015-09-23 | 华东师范大学 | Open Chinese entity relation extraction method using dependency analysis |
CN106202259A (en) * | 2016-06-29 | 2016-12-07 | 合肥民众亿兴软件开发有限公司 | A kind of info web extracting method based on body thought |
-
2017
- 2017-09-04 CN CN201710783655.4A patent/CN107590219A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104933027A (en) * | 2015-06-12 | 2015-09-23 | 华东师范大学 | Open Chinese entity relation extraction method using dependency analysis |
CN106202259A (en) * | 2016-06-29 | 2016-12-07 | 合肥民众亿兴软件开发有限公司 | A kind of info web extracting method based on body thought |
Non-Patent Citations (1)
Title |
---|
苏小鲁: "基于DOM的HTML网页正文信息抽取模块的设计与实现", 《中国优秀硕士学位论文全文数据库信息科技辑》 * |
Cited By (26)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108520007A (en) * | 2018-03-15 | 2018-09-11 | 江河瑞通(北京)技术有限公司 | Web page information extracting method, storage medium and computer equipment |
CN108520007B (en) * | 2018-03-15 | 2021-09-28 | 江河瑞通(北京)技术有限公司 | Web page information extracting method, storage medium and computer equipment |
CN108829696A (en) * | 2018-04-18 | 2018-11-16 | 西安理工大学 | Towards knowledge mapping node method for auto constructing in metro design code |
CN108829696B (en) * | 2018-04-18 | 2019-10-25 | 西安理工大学 | Towards knowledge mapping node method for auto constructing in metro design code |
CN108920434A (en) * | 2018-06-06 | 2018-11-30 | 武汉酷犬数据科技有限公司 | A kind of general Web page subject method for extracting content and system |
CN108920434B (en) * | 2018-06-06 | 2022-08-30 | 武汉酷犬数据科技有限公司 | Universal webpage theme content extraction method and system |
JP2020027649A (en) * | 2018-08-15 | 2020-02-20 | ベイジン バイドゥ ネットコム サイエンス アンド テクノロジー カンパニー リミテッド | Method, apparatus, device and storage medium for generating entity relationship data |
US11321421B2 (en) | 2018-08-15 | 2022-05-03 | Beijing Baidu Netcom Science And Technology Co., Ltd. | Method, apparatus and device for generating entity relationship data, and storage medium |
CN109325197B (en) * | 2018-08-17 | 2022-07-15 | 百度在线网络技术(北京)有限公司 | Method and device for extracting information |
CN109325197A (en) * | 2018-08-17 | 2019-02-12 | 百度在线网络技术(北京)有限公司 | Method and apparatus for extracting information |
CN109710833B (en) * | 2018-12-29 | 2021-07-16 | 上海蜜度信息技术有限公司 | Method and apparatus for determining content node |
CN109710833A (en) * | 2018-12-29 | 2019-05-03 | 上海蜜度信息技术有限公司 | For determining the method and apparatus of content node |
CN109977370B (en) * | 2019-03-19 | 2023-06-16 | 河海大学常州校区 | Automatic question-answer pair construction method based on document structure tree |
CN109977370A (en) * | 2019-03-19 | 2019-07-05 | 河海大学常州校区 | It is a kind of based on the question and answer of document collection partition to method for auto constructing |
CN110110193A (en) * | 2019-04-24 | 2019-08-09 | 北京百炼智能科技有限公司 | A kind of information processing method, device and computer readable storage medium |
CN111966932A (en) * | 2019-05-20 | 2020-11-20 | 富士通株式会社 | Information processing method and information processing apparatus |
CN110232125B (en) * | 2019-06-11 | 2020-10-02 | 吉林大学 | Method for extracting and aggregating academic figure information |
CN110232125A (en) * | 2019-06-11 | 2019-09-13 | 吉林大学 | A method of it carrying out academic people information and extracts and polymerize |
CN111625749A (en) * | 2020-06-01 | 2020-09-04 | 深圳市小满科技有限公司 | Method, device, equipment and medium for extracting detail page information of participating company website |
CN111625749B (en) * | 2020-06-01 | 2023-08-11 | 深圳市小满科技有限公司 | Method, device, equipment and medium for extracting website detail page information of participant company |
CN111698364A (en) * | 2020-06-19 | 2020-09-22 | 深圳市小满科技有限公司 | Contact person information extraction method and related equipment |
CN112287273A (en) * | 2020-10-27 | 2021-01-29 | 中国科学院计算技术研究所 | Method, system and storage medium for classifying website list pages |
CN112287273B (en) * | 2020-10-27 | 2022-09-30 | 中国科学院计算技术研究所 | Method, system and storage medium for classifying website list pages |
CN114201971A (en) * | 2021-12-13 | 2022-03-18 | 海南港航控股有限公司 | Method and system for extracting character attributes from webpage |
CN116127079A (en) * | 2023-04-20 | 2023-05-16 | 中电科大数据研究院有限公司 | Text classification method |
CN116127079B (en) * | 2023-04-20 | 2023-06-20 | 中电科大数据研究院有限公司 | Text classification method |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107590219A (en) | Webpage personage subject correlation message extracting method | |
Gatterbauer et al. | Towards domain-independent information extraction from web tables | |
CN102737013B (en) | Equipment and the method for statement emotion is identified based on dependence | |
Zheng et al. | Template-independent news extraction based on visual consistency | |
CN101727498A (en) | Automatic extraction method of web page information based on WEB structure | |
CN105975454A (en) | Chinese word segmentation method and device of webpage text | |
US8577887B2 (en) | Content grouping systems and methods | |
CN102609427A (en) | Public opinion vertical search analysis system and method | |
CN107577671A (en) | A kind of key phrases extraction method based on multi-feature fusion | |
CN112101004B (en) | General webpage character information extraction method based on conditional random field and syntactic analysis | |
Cardoso et al. | An efficient language-independent method to extract content from news webpages | |
CN113196278A (en) | Method for training a natural language search system, search system and corresponding use | |
CN114997288A (en) | Design resource association method | |
CN115017903A (en) | Method and system for extracting key phrases by combining document hierarchical structure with global local information | |
JP2007047974A (en) | Information extraction device and information extraction method | |
CN106372232B (en) | Information mining method and device based on artificial intelligence | |
CN112711666B (en) | Futures label extraction method and device | |
CN107145591A (en) | A kind of effective content metadata extracting method of webpage based on title | |
JP2006309347A (en) | Method, system, and program for extracting keyword from object document | |
CN110020024B (en) | Method, system and equipment for classifying link resources in scientific and technological literature | |
Souza et al. | ARCTIC: metadata extraction from scientific papers in pdf using two-layer CRF | |
Pembe et al. | A tree-based learning approach for document structure analysis and its application to web search | |
Bauer et al. | Fiasco: Filtering the internet by automatic subtree classification, osnabruck | |
Kim et al. | Annotated Bibliographical Reference Corpora in Digital Humanities. | |
CN112347353A (en) | Webpage denoising method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20180116 |