CN103473285A - Web information extraction method and device based on location markers - Google Patents

Web information extraction method and device based on location markers Download PDF

Info

Publication number
CN103473285A
CN103473285A CN2013103853730A CN201310385373A CN103473285A CN 103473285 A CN103473285 A CN 103473285A CN 2013103853730 A CN2013103853730 A CN 2013103853730A CN 201310385373 A CN201310385373 A CN 201310385373A CN 103473285 A CN103473285 A CN 103473285A
Authority
CN
China
Prior art keywords
attribute
marked
mark
label
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2013103853730A
Other languages
Chinese (zh)
Other versions
CN103473285B (en
Inventor
徐锐波
付赟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Qihoo Technology Co Ltd
Original Assignee
Beijing Qihoo Technology Co Ltd
Qizhi Software Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Qihoo Technology Co Ltd, Qizhi Software Beijing Co Ltd filed Critical Beijing Qihoo Technology Co Ltd
Priority to CN201310385373.0A priority Critical patent/CN103473285B/en
Publication of CN103473285A publication Critical patent/CN103473285A/en
Application granted granted Critical
Publication of CN103473285B publication Critical patent/CN103473285B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention discloses a Web information extraction method and device based on location markers. The method includes acquiring a training page marked with at least one attribute, and allowing the content with marked attribute to correspond to text content to be extracted in the web; acquiring prefix tags, including all tags between a current attribute and a last attribute, with multiple attributes, including marked attributes, in the training page; selecting an initial marker marked with the attribute in the prefix tags marked with the attributes; selecting an end marker in tags after marked with the attributes; querying the initial marker and the end marker in the web to be extracted, and extracting the attribute content between the initial marker and the end marker to acquire information contained in attribute content. According to the technical scheme, the problem of low efficiency and high failure rate by writing extraction rules mutually can be avoided, and efficiency of web information extraction is improved.

Description

The method for abstracting web page information of position-based mark and device
Technical field
The present invention relates to internet arena, in particular to a kind of method for abstracting web page information and device of position-based mark.
Background technology
Web page information extraction refers to and extracts target information and it is formed to the process of structural data from web page text.
Because webpage has certain structure to a certain extent, so Web page information extraction is different from the extraction to plain text.This structurized form has been brought certain shortcoming.In webpage; data can be cut apart by label usually; often interting the label that sentence itself is cut little ice in a complete sentence; thereby the meaning that the sentence that is beyond expression is original, this just makes traditional Text Information Extraction technology based on natural language processing can't be grafted directly to the Web page information extraction field.
Existing Web page information extraction technology, mainly rely on the manual compiling decimation rule to carry out, and, by the analysis to webpage and source code thereof, by the programming personnel, finds out some rules, then according to these regular coding extracting objects data.Artificial Rules extraction method exists following shortcoming:
While 1, need capturing large batch of website, each website is write to the decimation rule workload large, and operating personnel's programing work is subject to the subjective factor impact, has certain error rate;
2, when webpage correcting occurring, changing likely appears in the page structure of webpage, causes the rule of writing before this to lose efficacy, and needs to repeat the work of redaction rule, has had a strong impact on efficiency.
Carry out info web for current manual compiling rule and extract large, the inefficient problem of workload, not yet propose effective solution in prior art.
Summary of the invention
In view of the above problems, the present invention has been proposed in order to provide a kind of Web page information extraction device that overcomes the problems referred to above or address the above problem at least in part and corresponding method for abstracting web page information.
According to one aspect of the present invention, provide the method for abstracting web page information of position-based mark.The method for abstracting web page information of this position-based mark comprises the following steps: obtain the training page, in the training page, marked at least one attribute, the content that is marked attribute is corresponding with the content of text that needs in webpage to extract; Obtain the prefix label of a plurality of attributes in the training page, the prefix label comprises all labels between current attribute and last attribute, and a plurality of attributes comprise and are marked attribute; Choose the start mark that is marked attribute in being marked the prefix label of attribute; Choose end mark in label after being marked attribute; Inquiry start mark and end mark in the webpage extracted at needs, and extract the property content between start mark and end mark, with the information that obtains comprising in property content.
Further, start mark is label or the tag combination be marked in the prefix label of attribute, label or tag combination meet the following conditions: in the prefix label of a plurality of attributes, label or tag combination only appear in the prefix label that is marked attribute.
Alternatively, choosing the start mark that is marked attribute in being marked the prefix label of attribute comprises: before being marked attribute, adjacent label is designated as alternative mark; Judge alternative mark whether in the prefix label of a plurality of attributes unique, if, using alternative mark as start mark, if not, label and alternative mark that will be before adjacent with alternative mark be combined, combined result is designated as new alternative mark, until pick out start mark or the tag combination that is marked in the prefix label of attribute all not unique.
Further, before will being marked attribute, adjacent label also comprises before being designated as alternative mark: the prefix label that will be marked attribute is sorted according to the proximity relations distance with being marked attribute, wherein is marked the label that adjacent label before attribute is designated as sequence first.
Alternatively, the step of obtaining the training page comprises: obtain the content of text that needs extract, and will need the content of text extracted as the objective attribute target attribute value; Choose webpage as the training page from targeted website; The inquiry attribute identical or close with the objective attribute target attribute value in the training page, and using identical or close attribute as being marked attribute.
Further, in the training page, the inquiry attribute identical or close with the objective attribute target attribute value comprises: in the training page, judge whether the attribute that exists property value identical with the objective attribute target attribute value, if so, using this attribute identical with the objective attribute target attribute value as being marked attribute; If not, the objective attribute target attribute value is carried out to cutting, according to the similarity of the property value after the Chinese one's own profession of the training page and cutting, draw the attribute close with the objective attribute target attribute value.
Alternatively, the step of objective attribute target attribute being carried out to cutting comprises: remove all labels in the training page, obtain the array that the one's own profession of webpage Chinese forms; Calculate the average length of array Chinese one's own profession; The objective attribute target attribute value is carried out to cutting according to average length.
Alternatively, show that according to the similarity of the property value after the Chinese one's own profession of the training page and cutting the attribute close with the objective attribute target attribute value comprises: the similarity of the property value in the difference calculation training page after each line of text and cutting; Select one or more line of text that similarity is the highest; Whether the similarity that judges respectively one or more line of text is greater than predetermined threshold value, and one or more line of text and adjacent line of text are separately carried out to combination, judgement in conjunction with after the similarity of the property value of text after to cutting whether increase; The line of text of selecting above judged result to be to be is as the attribute close with the objective attribute target attribute value.
According to a further aspect in the invention, provide a kind of Web page information extraction device.This Web page information extraction device comprises: the first acquisition module, for obtaining the training page, in the training page, marked at least one attribute, and the content that is marked attribute is corresponding with the content of text that needs in webpage to extract; The second acquisition module, for obtaining the prefix label of the training page a plurality of attributes, the prefix label comprises all labels between current attribute and last attribute, a plurality of attributes comprise and are marked attribute; First chooses module, for the prefix label being marked attribute, chooses the start mark that is marked attribute; Second chooses module, for the label choosing after being marked attribute, chooses end mark; Information extraction module, for webpage inquiry start mark and the end mark extracted at needs, and extract the property content between start mark and end mark, with the information that obtains comprising in property content.
Alternatively, first chooses module comprises: the first definition submodule is designated as alternative mark for adjacent label before being marked attribute; The judgement submodule, for judge alternative mark whether at the prefix label of a plurality of attributes unique; The second definition submodule, in the situation that the judgement submodule is output as noly, will be combined with adjacent label and alternative mark before alternative mark, and combined result is designated as new alternative mark; The mark submodule, in the situation that the judgement submodule be output as and be, using alternative mark as start mark.
Further, the first acquisition module comprises: the objective attribute target attribute value is obtained submodule, for obtaining, needs the content of text extracted, and will need the content of text extracted as the objective attribute target attribute value; The training page obtains submodule, for choosing webpage from targeted website as the training page; The inquiry submodule, at the training page, inquiring about the attribute identical or close with the objective attribute target attribute value, and using identical or close attribute as being marked attribute.
Use technical scheme of the present invention, select respectively start mark and end mark in label before being marked attribute and label afterwards, the decimation rule that the start mark that draws and end mark are generated to the info web of position-based mark automatically, thereby avoid occurring the low high problem of error rate of efficiency of manual compiling decimation rule, improved the efficiency that info web extracts.
In addition by improving further, inquire about the attribute identical or close with the original object property value as being marked attribute in the training page, can be by the primitive attribute value, in new page structure, the position at the property value place that mark makes new advances, can again train and draw the decimation rule that meets the new web page structure.Thereby, when change occurs structure of web page, without artificial participation, just the automatic modification rule of energy, produce new decimation rule.Reach further reduction cost of labor and extracted the technique effect of the loss that error message causes.
Above-mentioned explanation is only the general introduction of technical solution of the present invention, in order to better understand technological means of the present invention, and can be implemented according to the content of instructions, and for above and other objects of the present invention, feature and advantage can be become apparent, below especially exemplified by the specific embodiment of the present invention.
The accompanying drawing explanation
By reading hereinafter detailed description of the preferred embodiment, various other advantage and benefits will become cheer and bright for those of ordinary skills.Accompanying drawing is only for the purpose of preferred implementation is shown, and do not think limitation of the present invention.And, in whole accompanying drawing, by identical reference symbol, mean identical parts.In the accompanying drawings:
Fig. 1 is the schematic diagram of the Web page information extraction device of position-based mark according to an embodiment of the invention;
Fig. 2 is the schematic diagram of the Web page information extraction device of position-based mark in accordance with another embodiment of the present invention;
Fig. 3 is the schematic diagram of the method for abstracting web page information of position-based mark according to an embodiment of the invention;
Fig. 4 searches the process flow diagram of start mark in the method for abstracting web page information of position-based mark in accordance with another embodiment of the present invention; And
Fig. 5 calculates the process flow diagram that is marked attribute in the method for abstracting web page information of position-based mark of another embodiment according to the present invention.
Embodiment
The algorithm provided at this is intrinsic not relevant to any certain computer, virtual system or miscellaneous equipment with demonstration.Various general-purpose systems also can with based on using together with this teaching.According to top description, it is apparent constructing the desired structure of this type systematic.In addition, the present invention is not also for any certain programmed language.It should be understood that and can utilize various programming languages to realize content of the present invention described here, and the top description that language-specific is done is in order to disclose preferred forms of the present invention.
The objective attribute target attribute of a small amount of artificial mark of the Web page information extraction device utilization of the position-based mark that the present embodiment provides, completed the generation of Web page information extraction rule.Fig. 1 is the schematic diagram of the Web page information extraction device of position-based mark according to an embodiment of the invention, this Web page information extraction device comprises: the first acquisition module 110, for obtaining the training page, marked at least one attribute in the training page, the content that is marked attribute is corresponding with the content of text that needs in webpage to extract; The second acquisition module 120, for obtaining the prefix label of the training page a plurality of attributes, the prefix label comprises all labels between current attribute and last attribute, a plurality of attributes comprise and are marked attribute; First chooses module 130, for the prefix label being marked attribute, chooses the start mark that is marked attribute; Second chooses module 140, for the label choosing after being marked attribute, chooses end mark; Information extraction module 150, for webpage inquiry start mark and the end mark extracted at needs, and extract the property content between start mark and end mark, with the information that obtains comprising in property content.
In the Web page information extraction device of the position-based mark of above embodiment, the training page that the first acquisition module 110 obtains, marked at least one attribute by manual type, the property content be marked is for the content of text with the need extraction, choosing module 130 and second by first chooses module 140 and chooses respectively start mark and end mark as position mark, thereby can generate the web page extraction rule of position-based mark, information extraction module 150 utilizes the web page extraction rule of position-based mark just the start mark in band extraction webpage and the property content between end mark can be extracted, the information that obtains comprising in property content.Web page information extraction device in this embodiment only needs artificial to a small amount of training page marked content, for example for a certain website, only need a page is marked, and, than manual compiling web page extraction rule, the workload of marked content greatly reduces and is lower to personnel's professional standards requirement.
Fig. 2 is the schematic diagram of the Web page information extraction device of position-based mark in accordance with another embodiment of the present invention, in the Web page information extraction device of this embodiment, first chooses module 130 specifically comprises: the first definition submodule 131 is designated as alternative mark for adjacent label before being marked attribute; The judgement submodule 132, for judge alternative mark whether at the prefix label of a plurality of attributes unique; The second definition submodule 133, in the situation that judgement submodule 132 is output as noly, will be combined with adjacent label and alternative mark before alternative mark, and combined result is designated as new alternative mark; Mark submodule 134, in the situation that judgement submodule 132 be output as and be, using alternative mark as start mark.Choose module 130 by above first, the start mark of picking out is label or the tag combination be marked in the prefix label of attribute, and meet the following conditions: in the prefix label of a plurality of attributes, as label or the tag combination of start-tag, only appear in the prefix label that is marked attribute.Owing to thering is above uniqueness, in a webpage, utilize initial mark just can obtain the position of content to be extracted.
Can easily label be carried out to mark for the ease of the second definition submodule 133, first chooses module 130 adjacent label before will being marked attribute can also carry out following beamhouse operation before being designated as alternative mark: the prefix label that will be marked attribute is sorted according to the proximity relations distance with being marked attribute, wherein is marked the label that adjacent label before attribute is designated as sequence first.Through sequence, all prefix labels are arranged in array according to the distance with attribute, can conveniently carry out tag combination to define alternative mark from this label array.
In addition, the first acquisition module 110 specifically can comprise: the objective attribute target attribute value is obtained submodule 111, for obtaining, needs the content of text extracted, and will need the content of text extracted as the objective attribute target attribute value; The training page obtains submodule 112, for choosing webpage from targeted website as the training page; Inquiry submodule 113, at the training page, inquiring about the attribute identical or close with the objective attribute target attribute value, and using identical or close attribute as being marked attribute.
In the vicissitudinous situation of the structure of web page of a website, cause the position at property value place to change or attribute before and after change, now, the training page obtains submodule 112 and can get new training webpage, inquiry submodule 113 is searched coupling and is obtained the new attribute that is marked in the attribute in new training webpage according to the objective attribute target attribute value, thereby, original object property value and being marked in the not on all four situation of attribute of needing, still can use above web page extraction device to carry out the extraction of webpage text content.Solved the problem that property value is incomplete same or original property value is separated by a plurality of labels of the correspondence of original object property value in the new training page.
The idiographic flow of inquiring about the step of the attribute identical or close with the objective attribute target attribute value in above inquiry submodule 113 in the training page can comprise: in the training page, judge whether the attribute that exists property value identical with the objective attribute target attribute value, if so, using this attribute identical with the objective attribute target attribute value as being marked attribute; If not, the objective attribute target attribute value is carried out to cutting, according to the similarity of the property value after the Chinese one's own profession of the training page and cutting, draw the attribute close with the objective attribute target attribute value.
Above cutting can be carried out cutting according to the average length of webpage Chinese one's own profession, thereby the above similarity according to the property value after the Chinese one's own profession of the training page and cutting draws the step of the attribute close with the objective attribute target attribute value, and concrete flow process can be: the similarity of the property value in the difference calculation training page after each line of text and cutting; Select one or more line of text that similarity is the highest; Whether the similarity that judges respectively one or more line of text is greater than predetermined threshold value, and one or more line of text and adjacent line of text are separately carried out to combination, judgement in conjunction with after the similarity of the property value of text after to cutting whether increase; The line of text of selecting above judged result to be to be is as the attribute close with the objective attribute target attribute value.Thereby can obtain mark attribute corresponding in the new training page.
Explanation below in conjunction with the method for abstracting web page information to the position-based mark, workflow to the Web page information extraction device of above embodiment further describes, the method for abstracting web page information of the position-based mark of wherein introducing in following examples can be carried out by the Web page information extraction device of above arbitrary embodiment, it should be noted that, following be steps flow chart in the method for embodiment in the process of implementation, can configure neatly and give the upper module execution, perhaps to adjusted rear execution with upper module, and do not rely on fixing modular structure.
Fig. 3 is the schematic diagram of the method for abstracting web page information of position-based mark according to an embodiment of the invention, and as shown in the figure, this method for abstracting web page information comprises the following steps:
Step S301, obtain the training page, in the training page, marked at least one attribute, and the content that is marked attribute is corresponding with the content of text that needs in webpage to extract;
Step S303, obtain the prefix label of a plurality of attributes in the training page, and the prefix label comprises all labels between current attribute and last attribute, and a plurality of attributes comprise and are marked attribute;
Step S305 chooses the start mark that is marked attribute in being marked the prefix label of attribute;
Step S307, choose end mark in the label after being marked attribute;
Step S309, inquiry start mark and end mark in the webpage extracted at needs, and extract the property content between start mark and end mark, with the information that obtains comprising in property content.
In this embodiment, the method for position-based mark is a plurality of attributes or all in the prefix label of attributes, pick out the start mark that the mark attribute has uniqueness from the training page, thereby when the to be extracted webpage identical with the training page structure carried out to information extraction, can just can obtain the position of attribute in webpage that comprises Extracting Information by start mark, thereby extract the property value of this attribute, the text message that obtains comprising in property content.
Owing to obtaining the prefix label of each attribute in step S303, the prefix label of next adjacent attribute is exactly the label after current attribute, therefore in step S307, the label after being marked attribute also can obtain from step S303, and this end mark carried out mark to the ending of attribute, guaranteed to be marked the integrality of attribute or attribute to be extracted and without unnecessary content.
From above explanation, can find out, start mark can be preferably label or the tag combination in the prefix label that is marked attribute, this label or tag combination meet the following conditions as start mark: in the prefix label of a plurality of attributes, label or tag combination only appear in the prefix label that is marked attribute.
In the prefix label, find the idiographic flow of the start mark that is marked attribute can adopt the mode of iteration to carry out, for example, adjacent label before being marked attribute is designated as to alternative mark; Judge alternative mark whether in the prefix label of a plurality of attributes unique, if, using alternative mark as start mark, if not, label and alternative mark that will be before adjacent with alternative mark be combined, combined result is designated as new alternative mark, until pick out start mark or the tag combination that is marked in the prefix label of attribute all not unique.
In order in iterative process, can promptly to combine new tag combination, before will being marked attribute, adjacent label can also comprise before being designated as alternative mark: the prefix label that will be marked attribute is sorted according to the proximity relations distance with being marked attribute, wherein is marked the label that adjacent label before attribute is designated as sequence first.Thereby the prefix label is carried out to pre-service, the prefix label is formed to queue, be attached to successively in order in alternative mark and just can obtain new alternative mark.
If the structure of web page of a website is had some change, change has all likely occurred in the position at property value place or the label of front and back, and this will cause can't can't directly find and the identical attribute of the property value of artificial mark in structure of web page after changing.For this problem, the method for the present embodiment can according to the value of original historical attribute, be trained new web page by further optimizing after finding the structure of web page change, produces new decimation rule.
In the web page extraction method of this optimization, above step S301 specifically can comprise: obtain the content of text that needs extract, and will need the content of text extracted as the objective attribute target attribute value; Choose webpage as the training page from targeted website; The inquiry attribute identical or close with the objective attribute target attribute value in the training page, and using identical or close attribute as being marked attribute.
Above objective attribute target attribute value can be the property value in the structure of web page before correcting, the original object attribute marked by personnel, by finding out identical or close attribute in the training after correcting also face as being marked attribute, continue to carry out the flow process of above step S303 to S309, just can in new webpage, extract corresponding text message.
Wherein, in the training page, the inquiry attribute identical or close with the objective attribute target attribute value specifically can comprise: in the training page, judge whether the attribute that exists property value identical with the objective attribute target attribute value, if so, using this attribute identical with the objective attribute target attribute value as being marked attribute; If not, the objective attribute target attribute value is carried out to cutting, according to the similarity of the property value after the Chinese one's own profession of the training page and cutting, draw the attribute close with the objective attribute target attribute value.
The text size of above cutting line of text can carry out according to the average length of webpage Chinese version, and under this cutting method, the step of objective attribute target attribute being carried out to cutting comprises: remove all labels in the training page, obtain the array that the one's own profession of webpage Chinese forms; Calculate the average length of array Chinese one's own profession; The objective attribute target attribute value is carried out to cutting according to average length.
Alternatively, show that according to the similarity of the property value after the Chinese one's own profession of the training page and cutting the attribute close with the objective attribute target attribute value comprises: the similarity of the property value in the difference calculation training page after each line of text and cutting; Select one or more line of text that similarity is the highest; Whether the similarity that judges respectively one or more line of text is greater than predetermined threshold value, and one or more line of text and adjacent line of text are separately carried out to combination, judgement in conjunction with after the similarity of the property value of text after to cutting whether increase; The line of text of selecting above judged result to be to be is as the attribute close with the objective attribute target attribute value.
Method for abstracting web page information below in conjunction with two concrete application examples to the position-based mark of above embodiment is remarked additionally.
In the first application example, need the handmarking to go out property value, the position mark that comprises start mark and end mark can be automatically generated, thereby the property location that needs Extracting Information can be in webpage, oriented.
For example in the html source code of the webpage of the website that needs Extracting Information, exist with next section:<li<p > name:360safe</p ></li >.In this section source code, mark<li >,<p > or<li ><p > can be as a mark.And so-called decimation rule refers to and respectively comprises a series of mark in the front and back of property value name:360safe, and can, from original web page, obtain needing the property value name:360safe extracted according to these marks.In the present embodiment, the mark be positioned at before the attribute of content to be extracted is called beginning label; The attribute mark afterwards that is positioned at content to be extracted becomes end mark.
Just need by the property value of artificial mark in order to extract the information needed, find out beginning label and the end mark that can locate this property value.
At first, at random from the webpage of handmarking's property value, select a page as the training page, the all labels that obtain in this training page form a plurality of prefix label arrays, the number of array is consistent with the attribute number in webpage, and wherein prefix label array is comprised of all labels between adjacent attribute before an attribute and this attribute.
Then need in being marked the prefix label array of attribute, select label unique in the page or unique tag combination as start mark, Fig. 4 searches the process flow diagram of start mark in the method for abstracting web page information of position-based mark in accordance with another embodiment of the present invention, as shown in this process flow diagram, at first perform step S401, by the label be marked in the prefix label of attribute, carry out sorting by reversals according to the distance that is marked attribute, then perform step S402;
Step S402, parameter is carried out initialization, parameters sortnig N=1, and then the label that is 1 using sequence number performs step S403 as initial alternative mark;
Step S403, judge alternatively to mark whether that the prefix mark at all properties is unique, if judgment result is that and be, and execution step S404, if the determination result is NO, execution step S405;
Step S404, using current alternative mark as start mark, and finish to search flow process;
Step S405, parameters sortnig adds one, and N=N+1, then perform step S406;
Step S406, judge whether current parameters sortnig surpasses the number of the prefix label that is marked attribute, namely judge whether whether all prefix tag combination that are marked attribute are all not unique in the page, if, explanation can't utilize the mark attribute to find suitable start mark, completes the information that also output algorithm lost efficacy of searching; If not, execution step S407, just need execution step S407;
Step S407, by N label and alternative marker combination, combined result is designated as new alternative mark, namely extends current alternative mark, forms new alternative mark, returns and carries out S403, carries out the next round iteration.
After drawing start mark, can be using the first adjacent label after the property value of attribute as end mark.Info web extracting rule using this start mark and end mark as the position-based mark.Utilize this rule, while carrying out the info web extraction, can in webpage, inquire about start mark and end mark, and using the property content of start mark and end mark as extracting content, thereby can draw the text message needed.
In conjunction with above<li ><p name:360safe</p</li example describe, the prefix label of attribute name:360safe is<li ><p >, sorting by reversals through step S401, obtain<p the sequence number of label is 1,<li > sequence number be 2, general<p in step S402 > label is as initial alternative mark, step S403 judgement<p > whether unique in the html of full page document, if unique, general<p > as start mark, if it is not unique, by expand to<li of alternative mark ><p >, and judgement<li ><p tag combination whether unique in the html of full page document, if unique, general<li ><p > as start mark, if it is not unique, due to<li ><p > be whole prefix labels, the finishing iteration deterministic process, algorithm lost efficacy.Because the prefix label in above example only has two, if<li<p tag combination still not unique, algorithm just lost efficacy.But in many situation, cycle criterion may need more number of times for the number of the prefix label of the attribute that generally needs information extraction.
Above step is to utilize the attribute create-rule of artificial mark; but the format of webpage often can be changed in existing website; after each webpage correcting; the workload of all carrying out artificial mark is also larger; therefore the method for abstracting web page information of another embodiment of the present invention can utilize the property calculation marked before correcting to draw the attribute of the page that meets new construction; namely by coupling, draw the mark attribute in new page, utilize the search procedure of this mark attribute repeatable position mark to obtain the decimation rule of new page.
When the structure of web page of website changes, the label of the position at attribute place or the front and back of attribute all may change.Need in the new training page, mate attribute corresponding to attribute that draws user annotation.Then using this, corresponding attribute, as being marked attribute, continues to carry out the method for introducing in above embodiment.Because the step of searching position mark was being introduced before, mainly in new page, search corresponding attribute once, namely coupling show that the process that is marked attribute is introduced.
The difficult point that above coupling is marked attribute is:
1, the objective attribute target attribute value of original artificial mark may be incomplete same with new property value, for example: description fragment originally " this software application is in ios, android platform ", new description fragment is " this software application is in ios, android, winphone platform ".
2, original property value may be separated by a plurality of labels, for example: new description fragment be " this software application is in<br > ios<br > android<br the winphone platform ".
For above-mentioned two situations, if require property value in full accord, the property value that can't make new advances according to old property value mark.
Fig. 5 calculates the process flow diagram that is marked attribute in the method for abstracting web page information of position-based mark of another embodiment according to the present invention, as shown in this process flow diagram,
At first perform step S501, the receiving target property value, this objective attribute target attribute value refers to the property value marked out in parent page, then performs step S502;
Step S502 chooses the training page from targeted website, then performs step the determining step of S503;
, there be the attribute identical with the objective attribute target attribute value in step S503 in the training of judgement page, judge after webpage correcting whether original property value and label are not had to impact; If exist, directly perform step S517; If there is no, perform step successively S504 to step S509;
Step S504, remove all labels in the training page, obtains the array that the one's own profession of webpage Chinese forms;
Step S505, carry out segmentation by the objective attribute target attribute value according to line of text array average length, is divided into the P section, initialization fragment sequence number parameter p=1;
Step S506, calculate the similarity of each line of text in p section objective attribute target attribute and array, and similarity can utilize the parameters such as Hamming distance, editing distance to estimate, and general editing distance or Hamming distance are larger, and similarity is less;
Step S507, choose the line of text of Q similarity maximum, and its similarity is followed successively by S1, S2 ... SQ, initialization line of text parameters sortnig q=1, generally select 3 the highest line of text of similarity, and the value of Q is generally 3;
Step S508, calculate q line of text and be combined rear and similarity p section objective attribute target attribute with adjacent before line of text, is designated as S q';
Step S509, calculate q line of text and be combined rear and similarity p section objective attribute target attribute with adjacent afterwards line of text, is designated as S q' ';
Step S510, judgement S q'<S qand S qthe S of ' '< q, if so, execution step S511, if not, execution step S513;
Step S511, judge whether Sq is greater than predetermined threshold value, judges that line of text is similar to the objective attribute target attribute value; If so, execution step S512, if not, execution step S513;
Step S512, using q line of text as being marked attribute, complete the matching process that is marked attribute;
Step S513, judgement q+1 >=Q?, if, the line of text that proves all Q similarity maximums iteration completes, and need to the objective attribute target attribute of next segmentation be calculated, and then performs step S514, if not, whether execution step S516, enter the next line of text to be selected of calculating and meet the requirements;
Step S514, judgement p+1 >=P?, if so, the segmentation that proves all objective attribute target attribute values iteration completes, if now still can not find the attribute that is marked of coupling, algorithm lost efficacy, and finished to calculate, if not, execution step S517, the coupling computation process that enters next section segmentation attribute;
Step S515, p adds up 1, returns to execution step S506;
Step S516, q adds up 1, returns to execution step S508;
Step S517, the attribute identical with the objective attribute target attribute value is as being marked attribute.
By top method, just can be by original objective attribute target attribute value, in new page structure, the position that is marked the attribute place that mark makes new advances.Then according to the flow process of introducing before, the new web page marked is selected to position mark, trained.Just new decimation rule can have been obtained.
According to the device of the embodiment of the present invention, wherein, described first chooses module comprises:
The first definition submodule, for will be described be marked attribute before adjacent label be designated as alternative mark;
The judgement submodule, for judge described alternative mark whether at the prefix label of described a plurality of attributes unique;
The second definition submodule, in the situation that described judgement submodule is output as noly, will be combined with adjacent label and described alternative mark before described alternative mark, and combined result is designated as new alternative mark;
The mark submodule, in the situation that described judgement submodule is output as is, using described alternative mark as described start mark.
According to the described device of the embodiment of the present invention, wherein, described the first acquisition module comprises:
The objective attribute target attribute value is obtained submodule, for obtaining the described content of text extracted that needs, and using the described content of text extracted that needs as the objective attribute target attribute value;
The training page obtains submodule, for choosing webpage from targeted website as the described training page;
The inquiry submodule, at the described training page, inquiring about the attribute identical or close with described objective attribute target attribute value, and using described identical or close attribute as the described attribute that is marked.
Invent by present technique, can reduce greatly the workload of artificial extracting rule, the situation that can avoid artificial extracting rule to make mistakes simultaneously.In addition, when change occurs structure of web page, without artificial participation, just the automatic modification rule of energy, produce new decimation rule.Thereby further reduce cost of labor and extract the loss that error message causes.
In the instructions that provided herein, a large amount of details have been described.Yet, can understand, embodiments of the invention can be in the situation that do not have these details to put into practice.In some instances, be not shown specifically known method, structure and technology, so that not fuzzy understanding of this description.
Similarly, be to be understood that, in order to simplify the disclosure and to help to understand one or more in each inventive aspect, in the description to exemplary embodiment of the present invention, each feature of the present invention is grouped together in single embodiment, figure or the description to it sometimes in the above.Yet the method for the disclosure should be construed to the following intention of reflection: the present invention for required protection requires the more feature of feature than institute clearly puts down in writing in each claim.Or rather, as following claims are reflected, inventive aspect is to be less than all features of the disclosed single embodiment in front.Therefore, claims of following embodiment are incorporated to this embodiment thus clearly, and wherein each claim itself is as independent embodiment of the present invention.
Those skilled in the art are appreciated that and can adaptively change and they are arranged in one or more equipment different from this embodiment the module in the equipment in embodiment.Can be combined into a module or unit or assembly to the module in embodiment or unit or assembly, and can put them into a plurality of submodules or subelement or sub-component in addition.At least some in such feature and/or process or unit are mutually repelling, and can adopt any combination to disclosed all features in this instructions (comprising claim, summary and the accompanying drawing followed) and so all processes or the unit of disclosed any method or equipment are combined.Unless clearly statement in addition, in this instructions (comprising claim, summary and the accompanying drawing followed) disclosed each feature can be by providing identical, be equal to or the alternative features of similar purpose replaces.
In addition, those skilled in the art can understand, although embodiment more described herein comprise some feature rather than further feature included in other embodiment, the combination of the feature of different embodiment means within scope of the present invention and forms different embodiment.For example, in the following claims, the one of any of embodiment required for protection can be used with array mode arbitrarily.
All parts embodiment of the present invention can realize with hardware, or realizes with the software module of moving on one or more processor, or realizes with their combination.It will be understood by those of skill in the art that and can use in practice microprocessor or digital signal processor (DSP) to realize according to some or all some or repertoire of parts in the web page extraction device of the embodiment of the present invention.The present invention for example can also be embodied as, for carrying out part or all equipment or device program (, computer program and computer program) of method as described herein.The program of the present invention that realizes like this can be stored on computer-readable medium, or can have the form of one or more signal.Such signal can be downloaded and obtain from internet website, or provides on carrier signal, or provides with any other form.
It should be noted above-described embodiment the present invention will be described rather than limit the invention, and those skilled in the art can design alternative embodiment in the situation that do not break away from the scope of claims.In the claims, any reference symbol between bracket should be configured to limitations on claims.Word " comprises " not to be got rid of existence and is not listed in element or the step in claim.Being positioned at word " " before element or " one " does not get rid of and has a plurality of such elements.The present invention can realize by means of the hardware that includes some different elements and by means of the computing machine of suitably programming.In having enumerated the unit claim of some devices, several in these devices can be to carry out imbody by same hardware branch.The use of word first, second and C grade does not mean any order.Can be title by these word explanations.

Claims (10)

1. the method for abstracting web page information of a position-based mark comprises the following steps:
Obtain the training page, in the described training page, marked at least one attribute, the content that is marked attribute is corresponding with the content of text that needs in webpage to extract;
Obtain the prefix label of a plurality of attributes in the described training page, described prefix label comprises all labels between current attribute and last attribute, and described a plurality of attributes comprise the described attribute that is marked;
Choose the described start mark that is marked attribute in the described prefix label that is marked attribute;
Choose end mark in described label after being marked attribute;
The described start mark of inquiry and described end mark in the webpage extracted at needs, and extract the property content between described start mark and described end mark, to obtain the information comprised in described property content.
2. method according to claim 1, wherein, described start mark is label or the tag combination in the described prefix label that is marked attribute, described label or tag combination meet the following conditions: in the prefix label of described a plurality of attributes, described label or tag combination only appear in the described prefix label that is marked attribute.
3. method according to claim 2, wherein, choose the described start mark that is marked attribute and comprise in the described prefix label that is marked attribute:
Be marked adjacent label before attribute and be designated as alternative mark described;
Judge described alternative mark whether in the prefix label of described a plurality of attributes unique, if so, using described alternative mark as described start mark,
If not, will be combined with adjacent label and described alternative mark before described alternative mark, combined result is designated as new alternative mark, until the tag combination of picking out in described start mark or the described prefix label that is marked attribute is all not unique.
4. method according to claim 2 wherein, is marked adjacent label before attribute and also comprises before being designated as alternative mark described:
The described prefix label that is marked attribute, according to being sorted with the described proximity relations distance that is marked attribute, wherein saidly is marked to the label that adjacent label before attribute is designated as sequence first.
5. method according to claim 1, wherein, choose end mark and comprise in described label after being marked attribute:
Choose and describedly be marked label adjacent after attribute as its end mark.
6. according to the described method of any one in claim 1 to 5, wherein, obtain the training page and comprise:
Obtain the described content of text extracted that needs, and using the described content of text extracted that needs as the objective attribute target attribute value;
Choose webpage as the described training page from targeted website;
The inquiry attribute identical or close with described objective attribute target attribute value in the described training page, and using described identical or close attribute as the described attribute that is marked.
7. method according to claim 6 wherein, is inquired about the attribute identical or close with described objective attribute target attribute value in the described training page, comprising:
Judge whether the attribute that exists property value identical with described objective attribute target attribute value in the described training page, if so, the attribute that this is identical with described objective attribute target attribute value is as the described attribute that is marked;
If not, described objective attribute target attribute value is carried out to cutting, according to the similarity of the property value after described training page Chinese one's own profession and cutting, draw the attribute close with described objective attribute target attribute value.
8. method according to claim 7, wherein, described objective attribute target attribute is carried out to cutting and comprise:
Remove all labels in the described training page, obtain the array that the one's own profession of webpage Chinese forms;
Calculate the average length of described array Chinese one's own profession;
Described objective attribute target attribute value is carried out to cutting according to described average length.
9. method according to claim 8 wherein, show that according to the similarity of the property value after described training page Chinese one's own profession and cutting the attribute close with described objective attribute target attribute value comprises:
Calculate respectively the similarity of the property value after each line of text and cutting in the described training page;
Select one or more line of text that similarity is the highest;
Whether the similarity that judges respectively described one or more line of text is greater than predetermined threshold value, and described one or more line of text and adjacent line of text are separately carried out to combination, judgement in conjunction with after the similarity of the property value of text after to described cutting whether increase;
The line of text of selecting above judged result to be to be is as the attribute close with described objective attribute target attribute value.
10. a Web page information extraction device comprises with lower module:
The first acquisition module, for obtaining the training page, marked at least one attribute in the described training page, and the content that is marked attribute is corresponding with the content of text that needs in webpage to extract;
The second acquisition module, for obtaining the prefix label of a plurality of attributes of the described training page, the prefix label comprises all labels between current attribute and last attribute, described a plurality of attributes comprise the described attribute that is marked;
First chooses module, at the described prefix label that is marked attribute, choosing the described start mark that is marked attribute;
Second chooses module, for choosing end mark choosing described label after being marked attribute;
Information extraction module, for the webpage described start mark of inquiry and the described end mark extracted at needs, and extract the property content between described start mark and described end mark, to obtain the information comprised in described property content.
CN201310385373.0A 2013-08-29 2013-08-29 Web information extraction method and device based on location markers Active CN103473285B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310385373.0A CN103473285B (en) 2013-08-29 2013-08-29 Web information extraction method and device based on location markers

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310385373.0A CN103473285B (en) 2013-08-29 2013-08-29 Web information extraction method and device based on location markers

Publications (2)

Publication Number Publication Date
CN103473285A true CN103473285A (en) 2013-12-25
CN103473285B CN103473285B (en) 2017-04-12

Family

ID=49798133

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310385373.0A Active CN103473285B (en) 2013-08-29 2013-08-29 Web information extraction method and device based on location markers

Country Status (1)

Country Link
CN (1) CN103473285B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104462532A (en) * 2014-12-23 2015-03-25 北京奇虎科技有限公司 Method and device for extracting webpage text
CN104965929A (en) * 2015-07-24 2015-10-07 网易传媒科技(北京)有限公司 Method and device for data processing
CN107729480A (en) * 2017-10-16 2018-02-23 北京神州泰岳软件股份有限公司 A kind of Text Information Extraction method and device of limited area
CN108228676A (en) * 2016-12-22 2018-06-29 腾讯科技(深圳)有限公司 Information extraction method and system
CN108959204A (en) * 2018-06-22 2018-12-07 中国科学院计算技术研究所 Internet monetary items information extraction method and system
CN109145305A (en) * 2018-09-10 2019-01-04 北京神州泰岳软件股份有限公司 A kind of information extracting method, device and server
CN110968989A (en) * 2018-09-27 2020-04-07 北京国双科技有限公司 Method and device for displaying error correction information on front-end page

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1410918A (en) * 2002-05-31 2003-04-16 浙江大学 Searching engine based on information extraction technique

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1410918A (en) * 2002-05-31 2003-04-16 浙江大学 Searching engine based on information extraction technique

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
李文立等: "《基于HTML 树和模板的文献信息提取方法研究》", 《计算机应用研究》 *
王晓飞: "《基于语义和规则的Web网页细粒度信息抽取方法》", 《BLOG.CSDN.NET/ZHANGFEI 2008/ARTICLE/DETAILS/8739674》 *

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104462532B (en) * 2014-12-23 2017-07-07 北京奇虎科技有限公司 The method and apparatus that Web page text is extracted
CN104462532A (en) * 2014-12-23 2015-03-25 北京奇虎科技有限公司 Method and device for extracting webpage text
CN104965929B (en) * 2015-07-24 2019-07-02 网易传媒科技(北京)有限公司 A kind of data processing method and device
CN104965929A (en) * 2015-07-24 2015-10-07 网易传媒科技(北京)有限公司 Method and device for data processing
US11093520B2 (en) 2016-12-22 2021-08-17 Tencent Technology (Shenzhen) Company Limited Information extraction method and system
CN108228676A (en) * 2016-12-22 2018-06-29 腾讯科技(深圳)有限公司 Information extraction method and system
CN108228676B (en) * 2016-12-22 2021-08-13 腾讯科技(深圳)有限公司 Information extraction method and system
CN107729480B (en) * 2017-10-16 2020-06-26 中科鼎富(北京)科技发展有限公司 Text information extraction method and device for limited area
CN107729480A (en) * 2017-10-16 2018-02-23 北京神州泰岳软件股份有限公司 A kind of Text Information Extraction method and device of limited area
CN108959204B (en) * 2018-06-22 2021-03-05 中国科学院计算技术研究所 Internet financial project information extraction method and system
CN108959204A (en) * 2018-06-22 2018-12-07 中国科学院计算技术研究所 Internet monetary items information extraction method and system
CN109145305A (en) * 2018-09-10 2019-01-04 北京神州泰岳软件股份有限公司 A kind of information extracting method, device and server
CN109145305B (en) * 2018-09-10 2022-12-16 鼎富智能科技有限公司 Information extraction method and device and server
CN110968989A (en) * 2018-09-27 2020-04-07 北京国双科技有限公司 Method and device for displaying error correction information on front-end page
CN110968989B (en) * 2018-09-27 2023-03-31 北京国双科技有限公司 Method and device for displaying error correction information on front-end page

Also Published As

Publication number Publication date
CN103473285B (en) 2017-04-12

Similar Documents

Publication Publication Date Title
CN103473285A (en) Web information extraction method and device based on location markers
CN104166683B (en) A kind of data digging method
CN104268216A (en) Data cleaning system based on internet information
CN105224648A (en) A kind of entity link method and system
CN107016019B (en) Database index creation method and device
CN102722709A (en) Method and device for identifying garbage pictures
CN103984757B (en) Search results pages is inserted the method and system of news information entry
CN104090976A (en) Method and device for crawling webpages by search engine crawlers
CN103399872B (en) The method and apparatus that webpage capture is optimized
CN103235775B (en) A kind of statistical machine translation method merging translation memory and phrase translation model
CN110969517B (en) Bidding life cycle association method, system, storage medium and computer equipment
CN106156041A (en) Hot information finds method and system
CN105159885A (en) Point-of-interest name identification method and device
CN109284763A (en) A kind of method and server generating participle training data
CN103942264A (en) Method and device for pushing webpages containing news information
CN106844412A (en) A kind of human face data collection method and device
CN106940711B (en) URL detection method and detection device
CN104778232B (en) Searching result optimizing method and device based on long query
CN105279249B (en) The determination method and device of the confidence level of interest point data in a kind of website
CN104298773A (en) Automatic base switching system and method for ETL operations
CN109614535B (en) Method and device for acquiring network data based on Scapy framework
CN102270111A (en) Command decoding method and command set simulation device
CN104462519A (en) Search query method and device
CN105320752B (en) A kind of method for digging and device of interest point data
CN105160032B (en) The determination method and device of the confidence level of interest point data in a kind of website

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20220711

Address after: Room 801, 8th floor, No. 104, floors 1-19, building 2, yard 6, Jiuxianqiao Road, Chaoyang District, Beijing 100015

Patentee after: BEIJING QIHOO TECHNOLOGY Co.,Ltd.

Address before: 100088 room 112, block D, 28 new street, new street, Xicheng District, Beijing (Desheng Park)

Patentee before: BEIJING QIHOO TECHNOLOGY Co.,Ltd.

Patentee before: Qizhi software (Beijing) Co., Ltd

TR01 Transfer of patent right