CN101344889B

CN101344889B - Method and system for network information extraction

Info

Publication number: CN101344889B
Application number: CN2008101175173A
Authority: CN
Inventors: 张小栓; 傅泽田; 张健; 胡亮; 穆维松; 赵明; 刘丽欣; 王传义; 冯亚北; 刘雪; 田东; 张领先
Original assignee: China Agricultural University; Beijing Information Science and Technology University
Current assignee: China Agricultural University; Beijing Information Science and Technology University
Priority date: 2008-07-31
Filing date: 2008-07-31
Publication date: 2011-04-13
Anticipated expiration: 2028-07-31
Also published as: CN101344889A

Abstract

The invention relates to the field of internet information processing, in particular to a method and a system for network information extraction. The invention adopts a technical proposal which comprises the steps as follows: S1: information extraction rules are extracted from a rule database; S2: the information extraction rules are used as inquiry conditions for collecting network information and web pages which accord with the information extraction rules are used for establishing a primary database; S3: a user inquiry string is obtained; and S4: a web page which is matched with the inquiry string in the primary database is output. The invention can provide search results with high precision, which is different from traditional search engines and is an extremely effective and accurate information obtaining tool, thus substantially enhancing the obtaining efficiency of business information of people.

Description

A kind of method and system of network information extraction

Technical field

The present invention relates to the internet information process field, particularly relate to a kind of method and system of network information extraction.

Background technology

Develop rapidly along with information retrieval technique, information retrieval technique has entered the stage of a comparative maturity, from the most original keyword matching analyzing or the like till now based on contextual analysis, pattern match, example coupling and applied statistics strategy, formed one and overlapped more complete thinking and perfect algorithm, and be widely applied on all kinds of search engines.

Search engine (Search Engine) is meant according to certain strategy, the specific computer program of utilization collects the information on the internet, after information being organized and handled, provides the information retrieval system of inquiry service for the user.From user's angle, search engine provides a page that comprises search box, at search box input word, submit to search engine by browser after, search engine will return the content-related information tabulation with user's input.

The information of internet presents explosive growth, people have greatly promoted the application development of search engine to the demand of information retrieval, meanwhile, the powerful growth momentum of ecommerce has been compiled the supply of commodities information resources that are difficult to imagine quantity, traditional search engines exists problems such as result for retrieval out of true, information content is too much, structure is lack of standardization, and people more and more need a kind of more intelligentized gopher to come to extract information such as commodity price from the magnanimity commodity webpage of portal website of e-commerce venture and electronics wholesale market.

Information extraction (Information Extraction) is by the method for machine learning the document sample set to be learnt, thereby generate the information extraction rule, just can carry out structuring to the information that comprises in the document then handles, become the institutional framework of form, the information of output set form is stored together with unified form.Because there is relation in specific implementation and its field that will use of information extraction, combine tight more information extraction rule with the field just perfect more, and the information extraction precision is also just high more.

Summary of the invention

The problem to be solved in the present invention provide a kind of can precise search the method and system of networking information, with problem such as overcome result for retrieval out of true in the prior art, information content is too much, structure is lack of standardization.

For achieving the above object, the invention provides a kind of method of network information extraction, may further comprise the steps:

S11: choose the training sample webpage;

S12: the processing of the training sample webpage of choosing being constructed web file structure tree;

S13: the method for utilizing machine learning is according to different information features and field, construct corresponding learning training sample,, by the study machine sample is learnt then manually to the sample mark, adjust the rule set of being concluded, from Web file structure tree, extract the information matches pattern;

S14: described information matches pattern is carried out variable replace;

S15: will import the decision tree trainer through the information matches pattern that variable is replaced, from uppermost root node, in the process that travels through from top to bottom along decision tree, each node is represented a classification problem, difference answer to problem on each node causes different branches, a leaf node can be arrived at last, corresponding information extraction rule can be exported from the set of node of this path correspondence;

S16: with the information extraction rale store in rule database.

S17: information extraction decimation rule from rule database;

S18: with described information extraction rule is querying condition collection network information, sets up the one-level database according to the webpage that meets the information extraction rule;

S19: obtain the user inquiring character string;

S20: the webpage that mates with inquiry string in the output one-level database.

Wherein, choosing the training sample webpage in step S11 comprises the steps:

S111: set up a dictionary relevant with content to be searched;

S112: utilize the dictionary of having set up to filter and the irrelevant webpage of dictionary content, screen the webpage of a collection of different layout structures;

S113: the interfere information in the webpage that deletion filters out;

S114: according to the information point in the dictionary content mark webpage;

S115: with the web storage of markup information point in the training sample database;

S116: from the training sample database, choose the training sample webpage.

Wherein, the processing of in step S12 the training sample webpage of choosing being constructed web file structure tree comprises the steps:

1) if the file that reads does not arrive end-of-file, then reads the mark in the file, carry out 2); If for end-of-file then represent that algorithm finishes;

2) if be beginning label, if root node is sky then creates root node that making current node is root node; Root node is not then created new node according to obtaining mark for sky else if, makes it to become the child node of current node, and mark should the new node of creating be a matched nodes, carried out 1);

If be not beginning label: if the mark that obtains is different with current node, create new node according to obtaining mark, make it to become the child node of current node, the node that makes new establishment is a current node, obtains the content of current node; The mark that obtains else if is identical with current node, creates new node according to obtaining mark, makes it to become the child node of current node, and the mark current node is a matched nodes, and to make the node of new establishment be current node, obtains the content of current node, carries out 3);

3) if be the coupling end mark of current node mark, this node of mark be matched nodes to make current node be its father node, carry out 1), otherwise carry out 4);

4) if do not find the node that mates with this end mark, date back to first older generation's node that does not mate of current node; Be else and if the node of end mark coupling, this older generation's node of mark is a matched nodes, and making its father node is current node, carries out 1).

Wherein, in step S18, set up the one-level database and comprise the steps:

S181: the interfere information in the webpage that deletion filters out;

S182: the information point of utilizing the webpage that information extraction algorithm extraction step S181 handled;

S183: the webpage that will extract information point deposits the one-level database in.

The present invention also provides a kind of network information extraction system, comprises

Information extraction Rule Extraction device, this device extract will retrieval of content rule information, utilize this rule information that the network information is searched for;

One-level database apparatus for establishing, this device is set up the one-level database according to the webpage that meets the information extraction rule;

The Webpage search device, webpage information matched is searched for and exported to this device according to the character string of user's output in the one-level database.

Wherein, information extraction Rule Extraction device comprises:

The training sample webpage is chosen the unit, and this unit is searched for the training sample webpage according to the speech relevant with content to be searched as search condition;

Web file structure tree tectonic element, the training webpage that this unit comes out to search is constructed the processing of web file structure tree;

The information matches pattern is refined the unit, extracts the information matches pattern from Web file structure tree;

Variable converting unit, information matches pattern are carried out variable and are replaced;

The information extraction rule database is set up the unit, and the information extraction rule that generates through the information matches pattern of decision tree trainer processing is exported in this unit, sets up the information extraction rule database according to the information extraction rule;

Information extraction Rule Extraction unit, information extraction decimation rule from the information extraction rule database.

Wherein, described one-level database apparatus for establishing comprises:

Information extraction rule recognition unit is used for the identifying information decimation rule;

The original web page acquiring unit is according to the qualified network information of information extraction rule search of identification;

Information extraction unit is extracted the information point on the webpage;

Information is preserved the unit, sets up the web database that extracted information point.

Wherein, the Webpage search device comprises:

User inquiring speech acquiring unit obtains the inquiry string of user's input;

Search the unit, in the one-level database, carry out match search according to the character string of user's input;

Output unit is exported the webpage that searches with the form of webpage collection.

Wherein, the training sample webpage is chosen the unit and is further comprised:

Subelement set up in dictionary, sets up relevant dictionary according to content to be searched;

The home page filter subelement is searched for qualified webpage according to the speech of relevant dictionary, and the interfere information on the deletion webpage;

Set up training sample database subelement,, the webpage of handling is set up database according to the information point in the dictionary content mark webpage;

Choose training sample webpage subelement, extract the webpage in the training sample database.

Compared with prior art, the present invention has the following advantages:

The present invention can provide the high precision that is different from traditional search engines result for retrieval, be one very effectively, information is obtained instrument accurately, the business information that greatly improves people obtains efficient.

Description of drawings

Fig. 1 is the process flow diagram of the method for a kind of network information extraction of the present invention.

Embodiment

Below in conjunction with drawings and Examples, the specific embodiment of the present invention is described in further detail.Following examples are used to illustrate the present invention, but are not used for limiting the scope of the invention.

Be given in the detailed implementation method in agricultural commodity field below.As shown in Figure 1, the present invention includes following steps:

Step s101 chooses training sample

1. set up a dictionary relevant with agricultural product by manual type, mainly comprise the content such as kind, the place of production, specification, brief introduction of agricultural product, its data volume that comprises training precision more at most is high more;

2. utilize the dictionary of having set up to filter and the irrelevant webpage of theme, screen the webpage that comprises agricultural product title, price, selling time and selling spot information of a collection of different layout structures;

3. further the webpage of these screenings is handled interfere informations such as the picture in the deletion webpage, copy;

4. manual agricultural product title, price, selling time and the selling spot of marking out in the webpage of handling, for example " { garlic } ", " { 0.95} ", " { 2007-10-1} ", " { Shouguang, Shandong wholesale vegetable market } ";

With the mark web storage in the training sample database.

Step s102 makes up decimation rule

1. the training sample webpage is handled, by html tag coupling structure Web file structure tree, detailed step is as follows:

1), do following operation if the file that reads does not arrive end-of-file:

2) if obtain the mark success;

3) if be that beginning label and root node are sky, create root node, making current node is root node;

4) if be that beginning label and root node are not sky;

5) if obtain be labeled as " img ";

6) create new node according to obtaining mark, make it to become the child node of current node, mark should the new node of creating be a matched nodes;

7) if the mark that obtains is different with current node, create new node according to obtaining mark, make it to become the child node of current node, making the node of new establishment is current node, obtains the content of current node;

8) if the mark that obtains is identical with current node, create new node according to obtaining mark, make it to become the child node of current node, the mark current node is a matched nodes, and to make the node of new establishment be current node, obtains the content of current node;

9) if be end mark;

10) if be the coupling end mark of current node mark, this node of mark is that to make current node be its father node to matched nodes;

11) if do not find and the node of this end mark coupling, date back to first older generation's node that does not mate of current node;

12) if with the node of end mark coupling, this older generation's node of mark is a matched nodes, making its father node is current node.

2. utilize the method for machine learning from Web file structure tree, to extract the information matches pattern, as " { garlic } is from { Shouguang, Shandong wholesale vegetable market } price: 0.95}{2007-10-1} ", wherein { garlic }, { Shouguang, Shandong wholesale vegetable market }, 0.95}, 2007-10-1} is the information point mark in the training sample, and with these pattern storage in database;

3. draw decimation rule from the information matches library, step is as follows:

1) pattern being carried out variable replaces, NAME represents trade name, and SOURCE represents the selling spot, and PRICE represents selling price, DATE represents selling time, %s{0, n} represent 0 space or invisible character to n quantity, and it is infinitely great that wherein n gets-1 expression, then can obtain as " NAME}%s{2; and 2} from: SOURCE}%s{1, the 10} price: PRICE}%s{1 ,-1}{2007-10-1} " and more higher leveled pattern;

2) the pattern input decision tree trainer that will handle, thereby the corresponding decimation rule of output.

4. decimation rule is stored in rule database.

Step s103 gathers html web page

1. in the built-in stack tabulation of the data acquisition unit of search engine, be used for store website URL address, as http://www.gov.cn;

2. from the stack tabulation URL on top of popping that is hit by a bullet, send the HTTP-POST request by data acquisition unit to the URL address then, obtain the web page contents of this URL correspondence again by HTTP-GET;

3. extract all related urls addresses in the webpage that has obtained, be kept at stack tabulation end, then this webpage is submitted to the Information Extractor of search engine to go into stack mode.

Step s104 extracts merchandise news

1. after the Information Extractor of search engine receives the webpage of collection, interfere informations such as the picture in the deletion webpage, copy, the html tag of correcting a mistake;

2. the webpage of having handled is utilized information extraction algorithm information extraction point, detailed step is as follows:

If the child node of structure of web page tree node to be processed is NULL, and preceding two characters of the tab character string of this node are " td ";

If the value of dataKind is UnKnown or Cost;

Decimation rule according to study is revised patternText;

Coupling is treated extraction information;

If the match is successful;

MatchString notes the information of coupling;

Find out " table " father node of the bottom at the node place that contains match information;

The matchLocation assignment is quoting this " table " father node;

The dataKind assignment is Kind;

If the value of dataKind is Kind;

Mate other and treat extraction information;

If the match is successful;

If it is identical with matchLocation to contain " table " father node of the residing bottom of node of match information;

Extraction information also deposits database in;

In this algorithm, note the meaning of several global variables:

DataKind: the current data type during the record extracted data;

MatchLocation: " table " father node that extracts the present bottom of tree node when being used to indicate extracted data;

MatchString: the information that is extracted when being used to store extracted data;

PatternText: the information matches model string that is used for extracted data;

With the information stores that extracts in merchandising database.

Step s105 provides search function

1. the Retrieval Interface of search engine is accepted the keyword of user's input, and searcher submitted to then in deletion function word wherein; Described function word is meant does not have complete lexical meaning, but the speech of grammatical meaning or functional meaning is arranged, and comprises adverbial word, preposition, conjunction, auxiliary word, interjection, onomatopoeia, for example,, and, but etc.

2. searcher mates the data that comprise keyword in merchandising database, returns to Retrieval Interface with the form of webpage collection, and described webpage collection is meant a kind of text of having stored the html web page of gathering and numbering in order;

3. Retrieval Interface carries out exporting to the user after paging, the ordering processing to the result who returns.

The above only is a preferred implementation of the present invention; should be pointed out that for those skilled in the art, under the prerequisite that does not break away from the technology of the present invention principle; can also make some improvements and modifications, these improvements and modifications also should be considered as protection scope of the present invention.

Claims

1. the method for a network information extraction is characterized in that, may further comprise the steps:

S11: choose the training sample webpage;

S14: described information matches pattern is carried out variable replace, described variable replaces with the variable that contains asterisk wildcard and replaces;

S16: with the information extraction rale store in rule database;

S17: information extraction decimation rule from rule database;

S19: obtain the user inquiring character string;

2. the method for network information extraction according to claim 1 is characterized in that, chooses the training sample webpage and comprise the steps: in step S11

S111: set up a dictionary relevant with content to be searched;

S113: the interfere information in the webpage that deletion filters out;

S116: from the training sample database, choose the training sample webpage.

3. the method for network information extraction according to claim 1 is characterized in that, the processing of in step S12 the training sample webpage of choosing being constructed web file structure tree comprises the steps:

4. the method for network information extraction according to claim 1 is characterized in that, in step S18, sets up the one-level database and comprises the steps:

S181: the interfere information in the webpage that deletion filters out;

5. a network information extraction system is characterized in that, comprising:

The Webpage search device, webpage information matched is searched for and exported to this device according to the character string of user's output in the one-level database;

Wherein, information extraction Rule Extraction device comprises:

6. network information extraction as claimed in claim 5 system is characterized in that described one-level database apparatus for establishing comprises:

Information extraction unit is extracted the information point on the webpage;

7. network information extraction as claimed in claim 5 system is characterized in that the Webpage search device comprises:

8. network information extraction as claimed in claim 5 system is characterized in that the training sample webpage is chosen the unit and further comprised: