CN103838732A

CN103838732A - Vertical search engine in life service field

Info

Publication number: CN103838732A
Application number: CN201210475513.9A
Authority: CN
Inventors: 梅昱婷; 刘博�
Original assignee: DALIAN LINGDONG TECHNOLOGY DEVELOPMENT Co Ltd
Current assignee: DALIAN LINGDONG TECHNOLOGY DEVELOPMENT Co Ltd
Priority date: 2012-11-21
Filing date: 2012-11-21
Publication date: 2014-06-04

Abstract

The invention relates to a vertical search engine in the life service field. The vertical search engine comprises the following steps of collecting information by using a professional web spider, extracting information, building an index and searching information. The web spider technology is applied by the vertical search engine, and the engine traverses web portals of the life service field, collects and stores the webpage related to subjects, and conducts link analysis and extraction on the webpage. The mode of combining DOM subject block extraction and the regular expression extraction is used by the vertical search engine for extracting structural information. Full-text searching service of information is provided for a user by building the index for the structural information of a database, weights are set and the searching results are reasonably arranged according to different fields. Finally, according to the characteristics of the life service, the data showing is based on the Internet more than, and the search service can be provided at any time and any where through mobile phone WAP.

Description

One way of life service field vertical search engine

Technical field

The present invention relates to search engine technique, particularly a kind of vertical search engine for service for life field.

Background technology

Along with the fast development of internet, the network information increases sharply, how from the network data of magnanimity, to retrieve information needed is fast and accurately problem demanding prompt solution, search engine is a kind of instrument of obtaining information from network that we the most often use, but mostly universal search engine is what the mode of employing key word was inquired about, return results and be more prone to knowledge production, his information content is large, it is not accurate enough to inquire about, the degree of depth is inadequate.Therefore, vertical search engine arises at the historic moment.Its range of search is no longer even up to ten million related web pages up to a hundred, but searches for for the domain knowledge of certain specific industry specially, is segmentation and the extension of search engine.Although vertical search engine also provides key word search, these key words are placed in the context of domain knowledge conventionally, and in the result of returning, message and entry are in the majority.Different from universal search engine, vertical search engine only gathers info web according to particular topic, and non-structured info web is transformed and be extracted as structural data, take structural data as the minimum unit of searching for.Then store these data into database, last participle, index meet user's request again in the mode of search.

Vertical search engine, according to the difference of professional domain, offers the corresponding information of user.One of most important application in vertical search engine towards the search engine in service for life field.Food and drink, amusement, shopping and house property etc. that people often touch in daily life can retrieve rapidly and accurately by this search engine.People needn't screen own Useful Information in a large amount of information, and the clothing, food, lodging and transportion--basic necessities of life that the vertical search engine in service for life field is people provide a large amount of valuable information, the daily life that can greatly be convenient for people to.

Compared with other search engine, the vertical search engine in service for life field needs professional Web Spider, the portal website in traversal service for life field, and the webpage of collection and preservation and Topic relative, and these webpages are carried out to link analysis and extraction; Can accurately webpage be carried out to structural data extraction and then be saved in database; Define weight, participle and set up index according to structured field territory; Also need to meet user's searching request anywhere or anytime by WEB and two kinds of data exhibiting modes of WAP.

Summary of the invention

For better meeting user's requirement, the present invention will design and Implement a kind of vertical search engine for service for life field, and this search engine is mainly towards food and drink, amusement and three fields of Yellow Page at present.

To achieve these goals, technical scheme of the present invention is as follows: the vertical search engine in service for life field, comprises the following steps:

A, the Information Monitoring of use specialized network spider

Specialized network spider selects tens more authoritative portal websites of a certain field as initial seed URL, and extracts theme feature.For the webpage collecting, analyze and extract as much as possible link wherein, insert in URL queue; Also want analyzing web page text message, extract characteristic item wherein, call correlation analysis module and calculate webpage and degree of subject relativity, the qualified web storage of result of calculation is in web page library.

Topic relativity analysis adopts the method for vector space model.Its basic thought is the dictionary (T this field theme ₁, T ₂... .., T _n) regard a n dimension coordinate system as, for any word T _iif, in webpage, comprise this word, give certain weights W according to significance level _i, otherwise W _ibe 0.Therefore, each webpage can be converted into one group of entry vector (W ₁, W ₂... .., W _n).W in system _iassignment method uses TF-IDF model, entry T _jat webpage D _iin TF-IDF value defined by following formula:

W_{i, j} = {TF}_{i} \cdot \log (\frac{N}{{DF}_{i}})

Wherein, TF _ientry T _jthe number of times occurring in this webpage; DF _irepresent to comprise entry T in whole webpage collection D _jwebpage number; N represents the sum of webpage.

Because webpage comprises various marks, link and text etc., the Feature Words significance level difference therefore occurring on diverse location, should weighted calculation:

TF _i=a·TF _M+b·TF _T+c·TF _K+d·TF _D+e·TF _A

Wherein, TFM, TFT, TFK, TFD, TFA represent respectively text, title, page key word, page-describing part and anchor text to carry out the word frequency number of Feature Words statistics, and a, b, c, d, e are respectively corresponding weighting coefficient.

Utilize the cosine value of vectorial angle to represent the degree of correlation between webpage and theme, result of calculation is between 0 to 1, and this webpage of the larger explanation of value more meets subject information, and specific formula for calculation is as follows:

Sim (D) = \cos θ = \frac{Σ_{i = 1}^{n} D_{i} \times T_{i}}{\sqrt{(Σ_{i = 1}^{n} D_{i}^{2}) \times (Σ_{i = 1}^{n} T_{i}^{2})}}

B, structured message extract

System completes in two steps the structuring of service for life information and accurately extracts: first use the irrelevant piece of dom tree automatic fitration and Topical Information from Web Pages, then, with using regular expression, accurately extract, and preserve in database.

B1, extract based on DOM theme piece

Utilize the tree-model of STU-DOM, using label nodes such as the <table> of webpage, <tr>, <div> and <tbody> as piecemeal node, for the choice of a piece, weigh with local correlation degree (Local Correlativity) and context dependent degree (Contextual Correlativity).Local correlation degree determines by piece internal chaining and content, and its computing formula can be expressed as:

LinkCount ({STU}_{i}) = Σ_{j = 1}^{N} LinkCount ({STUC}_{ij})

CountentLenth ({STU}_{i}) = Σ_{j = 1}^{N} ContentLength ({STUC}_{ij})

LocalCorrelativity ({STU}_{i}) = \frac{LinkCount ({STU}_{i})}{CountentLenth ({STU}_{i})}

Wherein, ContentLength and LinkCount represent respectively word number and the link number in piece, STUC _ijrepresent STU _ij sub-block.

Context dependent degree determines by piece internal chaining and father's piece content, and its computing formula can be expressed as:

ContextualCorrelativity ({STU}_{i}) = \frac{LinkCount ({STU}_{i})}{CountentLenth ({STU}_{Pi})}

Wherein, STU _pirepresent STU _ifather node.

B2, regular expression extract

For the Topic relative piece retaining in webpage, according to the feature manual compiling regular expression of metadata, as corresponding decimation rule, rule will guarantee the uniqueness of Data Matching.First, the html source code segment that selecting extraction information is corresponding, gets off these code snippet marks; Then for each pieces of information of mark, adopting regular expression is that it sets up a general match pattern string; Refilter the useless mark of the HTML comprising in the string that the match is successful, obtain plain text information; Finally, by the correctness of sample webpage verification using data-hiding technology match pattern string, if incorrect, rebuild.

The rule that completes structure is kept at XML configuration file, is independent of outside system code, is convenient to revise safeguard.When system operation, according to website and the web page library path selected, automatically load the regular expression rule that this website is corresponding, complete structured message extraction work.

C, index are set up

C1, index module

System is used Lucene technology to the structural data generating indexes file in database.Lucene provides the method for very simply setting up index, in the time setting up the object of Doctype, the territory (Field) of document is corresponding with the structure of the table of database or view, therefore can be according to metadata categories control retrieval weight, can also specify the territory that needs index, need the territory of participle etc.

C2, Chinese word segmentation

System adopts the Forward Maximum Method based on dictionary to divide word algorithm and double word hash index Dictionary Mechanism.First load dictionary, set up the Hash table of entry the first two word in dictionary, form three level list structure.Then, read its first character for word string Str to be slit, if can not find this word in one-level Hash table, using it as individual character cutting, after pointer, move a continuation simultaneously and again mate.On the contrary, if comprise this word in one-level Hash table, see that its rear word is whether in secondary Hash table.If there is no, lead-in is still as individual character cutting; If existed, be mapped in the orderly word string array with these two word beginnings, traversal array finds the longest coupling, if the match is successful, after just this string being cut out from Str, then Str is continued to process until be sky.

D, information retrieval

First the searched key word of user's input is cut to word, then from index file, search the document that comprises the each word being syncopated as and these document sets are gathered, obtain final result set.System adopts TF-IDF score function to calculate the score of document, and sorts by score height, and formula is as follows:

Score = \underset{t in q}{Σ} tf (d . t) \cdot idf (t) \cdot boost (d . t . field) \cdot lengthNorm (d . t . field)

Wherein, the tf scoring factor refers to the frequency that certain index entry occurs in a document; What the idf factor reflected is the number of files that comprises this index entry, and the more polyfactorial value of quantity is less; The boost factor can be used for controlling certain territory in document importance for the document, and the importance of certain document in all documents; The lengthNorm factor is the size of document, if the larger value of document is just lower.

Compared with prior art, the present invention has following beneficial effect:

1, because using professional Web Spider, the present invention gathers the info web in a certain field, the result wide coverage of information acquisition, with a high credibility;

2, the present invention extracts with DOM theme piece and regular expression extracts the mode combining and carries out the extraction of structured message.DOM theme piece Extraction parts can filter out the webpage irrelevant with theme, the efficiency of raising information extraction that can be very large; Regular expression extracts and can mate accurately, improves the accuracy rate of information extraction.

Accompanying drawing explanation

4, the total accompanying drawing of the present invention, wherein:

Fig. 1 is basic function figure of the present invention;

Fig. 2 is the workflow of professional spider in the present invention.

Fig. 3 is participle workflow in the present invention.

Fig. 4 is frame diagram of the present invention

Embodiment

The basic function of the service for life field vertical search engine system of the present invention's exploitation as shown in Figure 1.As shown in Figure 4, this system can be divided into two of front-end and back-end part.Rear end runs on server, for the retrieval service of front end provides Data support, comprises that information acquisition, extraction and index set up three parts.Front end adopts B/S pattern.System has been opened the vertical search engine in food and drink, amusement and three fields of Yellow Page at present, covers 37 main cities, the whole nation.System is supported the searching request of WAP simultaneously by www.zhaocha.mobi, make user just can access easily relevant service for life information with mobile phone whenever and wherever possible.The specific implementation of information acquisition, information extraction, index foundation and information retrieval is as follows:

A, the Information Monitoring of use specialized network spider

Specialized network spider selects tens more authoritative portal websites of a certain field as initial seed URL, and extracts theme feature.For the webpage collecting, analyze and extract as much as possible link wherein, insert in URL queue; Also want analyzing web page text message, extract characteristic item wherein, call correlation analysis module and calculate webpage and degree of subject relativity, the qualified web storage of result of calculation is in web page library, and its detailed workflow as shown in Figure 2.

W_{i, j} = {TF}_{i} \cdot \log (\frac{N}{{DF}_{i}})

TF _i=a·TF _M+b·TF _T+c·TF _K+d·TF _D+e·TF _A

Sim (D) = \cos θ = \frac{Σ_{i = 1}^{n} D_{i} \times T_{i}}{\sqrt{(Σ_{i = 1}^{n} D_{i}^{2}) \times (Σ_{i = 1}^{n} T_{i}^{2})}}

B, structured message extract

B1, extract based on DOM theme piece

LinkCount ({STU}_{i}) = Σ_{j = 1}^{N} LinkCount ({STUC}_{ij})

CountentLenth ({STU}_{i}) = Σ_{j = 1}^{N} ContentLength ({STUC}_{ij})

LocalCorrelativity ({STU}_{i}) = \frac{LinkCount ({STU}_{i})}{CountentLenth ({STU}_{i})}

ContextualCorrelativity ({STU}_{i}) = \frac{LinkCount ({STU}_{i})}{CountentLenth ({STU}_{Pi})}

Wherein, STU _pirepresent STU _ifather node.

B2, regular expression extract

C, index are set up

C1, index module

C2, Chinese word segmentation

System adopts the Forward Maximum Method based on dictionary to divide word algorithm and double word hash index Dictionary Mechanism.As shown in Figure 3, first load dictionary, set up the Hash table of entry the first two word in dictionary, form three level list structure.Then, read its first character for word string Str to be slit, if can not find this word in one-level Hash table, using it as individual character cutting, after pointer, move a continuation simultaneously and again mate.On the contrary, if comprise this word in one-level Hash table, see that its rear word is whether in secondary Hash table.If there is no, lead-in is still as individual character cutting; If existed, be mapped in the orderly word string array with these two word beginnings, traversal array finds the longest coupling, if the match is successful, after just this string being cut out from Str, then Str is continued to process until be sky.

D, information retrieval

Score = \underset{t in q}{Σ} tf (d . t) \cdot idf (t) \cdot boost (d . t . field) \cdot lengthNorm (d . t . field)

Claims

1. the vertical search engine of one way of life service field, is characterized in that: comprise the following steps:

A, the Information Monitoring of use specialized network spider

Specialized network spider selects tens more authoritative portal websites of a certain field as initial seed URL, and extracts theme feature; For the webpage collecting, analyze and extract as much as possible link wherein, insert in URL queue; Also want analyzing web page text message, extract characteristic item wherein, call correlation analysis module and calculate webpage and degree of subject relativity, the qualified web storage of result of calculation is in web page library;

Topic relativity analysis adopts the method for vector space model; Its basic thought is the dictionary (T this field theme ₁, T ₂... .., T _n) regard a n dimension coordinate system as, for any word T _iif, in webpage, comprise this word, give certain weights W according to significance level _i, otherwise W _ibe 0; Therefore, each webpage can be converted into one group of entry vector (W ₁, W ₂... .., W _n); W in system _iassignment method uses TF-IDF model, entry T _jat webpage D _iin TF-IDF value defined by following formula:

W_{i, j} = {TF}_{i} \cdot \log (\frac{N}{{DF}_{i}})

Wherein, TF _ientry T _jthe number of times occurring in this webpage; DF _irepresent to comprise entry T in whole webpage collection D _jwebpage number; N represents the sum of webpage;

TF _i=a·TF _M+b·TF _T+c·TF _K+d·TF _D+e·TF _A

Wherein, TFM, TFT, TFK, TFD, TFA represent respectively text, title, page key word, page-describing part and anchor text to carry out the word frequency number of Feature Words statistics, and a, b, c, d, e are respectively corresponding weighting coefficient;

Sim (D) = \cos θ = \frac{Σ_{i = 1}^{n} D_{i} \times T_{i}}{\sqrt{(Σ_{i = 1}^{n} D_{i}^{2}) \times (Σ_{i = 1}^{n} T_{i}^{2})}}

B, structured message extract

System completes in two steps the structuring of service for life information and accurately extracts: first use the irrelevant piece of dom tree automatic fitration and Topical Information from Web Pages, then, with using regular expression, accurately extract, and preserve in database;

B1, extract based on DOM theme piece

Utilize the tree-model of STU-DOM, using label nodes such as the <table> of webpage, <tr>, <div> and <tbody> as piecemeal node, for the choice of a piece, weigh with local correlation degree and context dependent degree; Local correlation degree determines by piece internal chaining and content, and its computing formula can be expressed as:

LinkCount ({STU}_{i}) = Σ_{j = 1}^{N} LinkCount ({STUC}_{ij})

CountentLenth ({STU}_{i}) = Σ_{j = 1}^{N} ContentLength ({STUC}_{ij})

LocalCorrelativity ({STU}_{i}) = \frac{LinkCount ({STU}_{i})}{CountentLenth ({STU}_{i})}

Wherein, ContentLength and LinkCount represent respectively word number and the link number in piece, STUC _ijrepresent STU _ij sub-block;

ContextualCorrelativity ({STU}_{i}) = \frac{LinkCount ({STU}_{i})}{CountentLenth ({STU}_{Pi})}

Wherein, STU _pirepresent STU _ifather node;

B2, regular expression extract

For the Topic relative piece retaining in webpage, according to the feature manual compiling regular expression of metadata, as corresponding decimation rule, rule will guarantee the uniqueness of Data Matching; First, the html source code segment that selecting extraction information is corresponding, gets off these code snippet marks; Then for each pieces of information of mark, adopting regular expression is that it sets up a general match pattern string; Refilter the useless mark of the HTML comprising in the string that the match is successful, obtain plain text information; Finally, by the correctness of sample webpage verification using data-hiding technology match pattern string, if incorrect, rebuild;

The rule that completes structure is kept at XML configuration file, is independent of outside system code, is convenient to revise safeguard; When system operation, according to website and the web page library path selected, automatically load the regular expression rule that this website is corresponding, complete structured message extraction work;

C, index are set up

C1, index module

System is used Lucene technology to the structural data generating indexes file in database; Lucene provides the method for very simply setting up index, in the time setting up the object of Doctype, the territory Field of document is corresponding with the structure of the table of database or view, therefore can be according to metadata categories control retrieval weight, can also specify the territory that needs index, need the territory of participle etc.;

C2, Chinese word segmentation

System adopts the Forward Maximum Method based on dictionary to divide word algorithm and double word hash index Dictionary Mechanism; First load dictionary, set up the Hash table of entry the first two word in dictionary, form three level list structure; Then, read its first character for word string Str to be slit, if can not find this word in one-level Hash table, using it as individual character cutting, after pointer, move a continuation simultaneously and again mate; On the contrary, if comprise this word in one-level Hash table, see that its rear word is whether in secondary Hash table; If there is no, lead-in is still as individual character cutting; If existed, be mapped in the orderly word string array with these two word beginnings, traversal array finds the longest coupling, if the match is successful, after just this string being cut out from Str, then Str is continued to process until be sky;

D, information retrieval

First the searched key word of user's input is cut to word, then from index file, search the document that comprises the each word being syncopated as and these document sets are gathered, obtain final result set; System adopts TF-IDF score function to calculate the score of document, and sorts by score height, and formula is as follows:

Score = \underset{t in q}{Σ} tf (d . t) \cdot idf (t) \cdot boost (d . t . field) \cdot lengthNorm (d . t . field)