CN105404677A

CN105404677A - Tree structure based retrieval method

Info

Publication number: CN105404677A
Application number: CN201510818902.0A
Authority: CN
Inventors: 陈虹宇; 罗阳; 苗宁
Original assignee: SICHUAN SHENHU TECHNOLOGY Co Ltd
Current assignee: Sichuan Shenhu Technology Co.,Ltd.
Priority date: 2015-11-20
Filing date: 2015-11-20
Publication date: 2016-03-16
Anticipated expiration: 2035-11-20
Also published as: CN105404677B

Abstract

The present invention proposes a tree structure based retrieval method. The method is used for processing Chinese web page data in Chinese searching engines. The method comprises: S100, pre-processing web page data; S200, establishing a web page data index file; and S300, receiving a query character string input by a user and carrying out retrieval according to a web page data index. According to the tree structure based retrieval method disclosed by the present invention, an intra-binary correlated subsequent tree model is adopted to establish the index for the web page data, and the advantages and disadvantages of a character index and a word index are simultaneously considered, so that when the index space is reduced, the retrieving efficiency is enhanced.

Description

A kind of search method based on tree structure

Technical field

The present invention relates to data processing field, be specifically related to a kind of search method based on tree structure.

Background technology

Along with the develop rapidly of internet, the exponential increase of information, the diversity of data mode, people are difficult in the information of Hai Liang, find the part meeting oneself demand rapidly.The appearance of full-text database, substantially improves this present situation.Full-text database, also referred to as text database, it is the system of management mass text.The work that full-text database will complete remains two large functions of traditional database: storing and retrieval, is exactly specifically the storage of text data and the retrieval of arbitrary string.Character string as search condition can be constant ocra font ocr string, also can be the string assemble that regular expression (or other modes, such as distance limit etc.) a group of representing has common trait.

More common and popular Full-Text Index Model has following several model at present: documents signed (DS) (SignatureFiles), bitmap (BitMap), inverted list (InvertedList), Σ ²matrix Pat tree and Pat array etc.These models are under the effort of brainstrust, and quite maturation is also used widely in practice.

The method extended out from title index is exactly present most widely used inverted list model.It has the advantages that to create index speed, widespread use in network search engines.But the storage space needed for it is comparatively large, and inquiry velocity is slower.Although documents signed (DS) realizes simple, be to locate the vector that a suitable hash function and width be applicable to very difficult, and different because of object.If do not chosen, then Query Result just there will be suitable uncertainty.Bitmap file index structure thinking is simple, and easy to use, time efficiency is high, especially efficient in Boolean retrieval, but its space efficiency is very low, even if employ bits compression algorithm, is still difficult to accept.The great advantage of Pat tree-model is that recall precision is very high, and especially special to model retrieval is as higher in the recall precision such as prefix search, range retrieval.But the same with bitmap models, space efficiency is extremely low, and in constructive process, space expense is larger, creates efficiency also very low.Pat array is the amendment to Pat tree, and the leaf node serialization that Pat sets just is obtained Pat array by it.Although Pat array have compressed the expense in constructive process to a great extent, because adopt the storage mode of array, it creates and merges needs mobile a large amount of data, and dynamic is difficult to satisfactory.

Summary of the invention

At least part of solution problems of the prior art, the present invention proposes a kind of search method based on tree structure, in Chinese search engine to the process of Chinese web page data, comprising:

Step S100, web data pre-service;

Step S200, sets up web data index file;

Step S300, receives the inquiry string of user's input, retrieves according to web data index.

The described search method based on tree structure, wherein, the described web data index file in step S200 be to process after web data set up web data index composition file.

The described search method based on tree structure, wherein, described web data index is word indexing.

The described search method based on tree structure, wherein, described web data index is glossarial index.

The described search method based on tree structure, wherein, described web data index is word indexing and glossarial index.

The described search method based on tree structure, wherein, described web data index is the index created based on relevant subsequent tree in binary.

The described search method based on tree structure, wherein, step S100 comprises further:

First the original web page captured is classified, and then extract the text message in webpage respectively according to classification, obtain sorted text message; Web page index file is set up in each classification that the process of generating web page index file comprises for original web page respectively.

The described search method based on tree structure, wherein, in step s 200, set up web data index file and comprise further:

First, judge the capacity of the text message of each classification, when the capacity of described classification is less than 1GB, for the text message of described classification sets up word indexing, when the capacity of described classification is more than or equal to 1GB, for the text message of described classification sets up glossarial index.

The described search method based on tree structure, comprises further:

Inquiry string is decomposed into word and word respectively, is the situation of word indexing for web data index, retrieves according to described word indexing by radical; For the situation that web data index is glossarial index, retrieve according to described glossarial index by participle.

The described search method based on tree structure, concrete retrieving is:

First stage, the situation being word indexing for web data index is retrieved;

First each word after inquiry string decomposition is sequentially read in, get first character A, for word indexing, finding with A in relevant subsequent tree in binary is the tree of root, then the next word B of branch ground matching inquiry character string one by one in the leaf of tree A, the next number of B is then added queue by the words matching B, until end is all mated in whole branches of A; Forwarding to B is the tree of root, the branch number of B tree is taken out from queue, search the next word C that corresponding leafy node comes in matched character string, circulation like this is not until once to match in matching process or inquiry string all mates end, if the match is successful, then mean the original text that have found and comprise inquiry string;

Subordinate phase, the situation being glossarial index for web data index is retrieved;

First each word after inquiry string decomposition is sequentially read in, get first word A, for glossarial index, finding with A in relevant subsequent tree in binary is the tree of root, then the next word B of branch ground matching inquiry character string one by one in the leaf of tree A, the next number of B is then added queue by the words matching B, until end is all mated in whole branches of A; Forwarding to B is the tree of root, the branch number of B tree is taken out from queue, search the next word C that corresponding leafy node comes in matched character string, circulation like this is not until once to match in matching process or inquiry string all mates end, if the match is successful, then mean the original text that have found and comprise inquiry string.

The present invention adopts relevant subsequent tree-model in binary to be that web data creates index, considers the relative merits of word indexing and glossarial index simultaneously, while minimizing index space, improves recall precision.

Accompanying drawing explanation

Fig. 1 is the process flow diagram of the search method that the present invention is based on tree structure;

Embodiment

Below in conjunction with accompanying drawing of the present invention, technical scheme of the present invention is clearly and completely described.Here will be described exemplary embodiment in detail, its sample table shows in the accompanying drawings.When description below relates to accompanying drawing, unless otherwise indicated, the same numbers in different accompanying drawing represents same or analogous key element.Embodiment described in following exemplary embodiment does not represent all embodiments consistent with the present invention.On the contrary, they only with as in appended claims describe in detail, the example of apparatus and method that aspects more of the present invention are consistent.

First, be described below using some terms in the present invention:

(1) source document storehouse: source document storehouse refers to the original web page file set captured from internet by web crawlers, being used for setting up index is that user search uses, this set is not static, according to the strategy captured or regular batch updating, or incremental update, main to ensure the timeliness n of system, capture timely by emerging webpage as far as possible and come to set up index, meet the Search Requirement of user;

(2) pre-service: pre-service refers to the process processed the web page files captured, comprise: set up index web page library, info web extracts, setting up index web page library is exactly the URL that will realize a given webpage, can find the webpage corresponding to this URL in index web page library; It is exactly from webpage, extract the text message set up required for index that info web extracts, and comprises title, text etc.;

(3) text: the set of the text message formed is extracted through the info web of pre-treatment step in source document storehouse, is the direct object setting up index;

(4) dictionary for word segmentation: dictionary for word segmentation is the basis of participle operation, is the set of Chinese vocabulary, is used for distinguishing when participle and being syncopated as the word in text;

(5) participle operation: participle operation is exactly process text dividing being become word combination, the inquiry string of pretreated text and user's input is all its operand, it correctly can match vocabulary according to dictionary for word segmentation in text-string, weed out stop words, export the set of index term or term;

(6) interior relevant subsequent tree index is set up: this operation is exactly the process of relevant subsequent tree in setting up the index term set formed after text participle;

(7) glossarial index file: glossarial index file is exactly the index file generated in previous action, in glossarial index file, each index term has a follow-up tree being root with it, contains numbering and its follow-up of index term place text in the node of tree;

(8) inquiry string: the query demand of user normally represents with the form of character string, the word that the simplest inquiry string is single often, for longer inquiry string, as phrase or sentence, carry out word segmentation processing, the segmentation methods adopted should be consistent with the segmentation methods used text, to ensure the accuracy of result for retrieval;

(9) search operaqtion: for the set of term according to the lookup algorithm of interior relevant subsequent tree-model, search in glossarial index file, the text matched number is outputted to result set;

(10) result set: result set is the text number generally referring to the inquiry string place obtained by search operaqtion, the text obtained may be a series of document, this just needs according to certain information retrieval model collection of document process, draw the collection of document that degree of correlation is maximum, present to user.

The related definition of interior relevant subsequent tree:

(1) follow-up: to the character string a in text T ₁a ₂, a ₂be called a ₁follow-up, the follow-up of last character of text T is called end mark, represents by " # ".Follow-up definition is the basis of interior relevant subsequent tree-model, always have identical character in text to occur, be exactly identical word or word specifically, if some index entry a have occurred k time, so a has just had k follow-up (a is not the ending of text), is denoted as a [s], s=1,2 ..., k.

(2) the follow-up expression formula of unitary and the follow-up tree of unitary: suppose that T is by character string a in full ₁, a ₂..., a _n, # composition, if a wherein _i1=a _i2=...=a _ikbe identical character, be designated as a, and a _i1+1, a _i2+1..., a _ik+1be the follow-up of them respectively, then all a and the follow-up of it just constitute a follow-up expression formula a (a of unitary _i1+1, a _i2+1..., a _ik+1), describe this expression formula with one tree, a is tree root, a _i1+1, a _i2+1..., a _ik+1be its follow-up node, this tree just becomes the follow-up tree of unitary of a.

(3) the follow-up expression formula of binary and the follow-up tree of binary: the follow-up expression formula of unitary is expanded, if having identical character string a in original text T _i1a _i1+1=a _i2a _i2+1=...=a _ika _ik+1, be denoted as ab, then all ab and its follow-uply just constitute a follow-up expression formula of binary, be denoted as a (b (a _i1+2, a _i2+2..., a _ik+2)).

Citing: (text in the present invention's citing is all for this text in text abcbacabacc#, called after T), ab character string has occurred 2 times, then its text library exists a follow-up expression formula of binary and is: a (b (c, a)).This expression formula is represented, by a with tree _i1+2, a _i2+2..., a _ik+2these are follow-up represents by its sequence number of their place branches in the follow-up tree of unitary taking b as root, can change into: a (b (1,3)) for the follow-up expression formula of binary upper example.The follow-up tree representation of binary of a is: a is tree root, a _i1+1, a _i2+1..., a _ik+1the follow-up of a, (a _i1+1, tag ₁), (a _i2+1, tag ₂) ..., (a _ik+1, tag _k) then as the follow-up node of a, wherein, tag ₁, tag ₂, tag _kbe divided into is with a _i1+1, a _i2+1..., a _ik+1for a in the follow-up tree of unitary of root _i1+1, a _i2+1..., a _ik+1the sequence number of follow-up place branch.Be followed successively by b, c, b, c for above-mentioned the follow-up of text T, a, corresponding follow-up node is followed successively by (b, 1), (c, 2), (b, 3), (c, 3).

The definition of interior relevant subsequent tree: set the forest formed by the follow-up of all index entries of document whole in a source document storehouse, be called the interior relevant subsequent tree in this source document storehouse, when described follow-up tree is the follow-up tree of binary, this interior relevant subsequent tree is relevant subsequent tree in binary.

See Fig. 1, the present invention propose a kind of search method based on tree structure, in Chinese search engine to the process of Chinese web page data, comprising:

Step S100, web data pre-service;

(1) extract the text message in webpage, generate corresponding text and text is numbered.This process groundwork is extracted by the text message in web page files, generates the document being used for networking page data index.

(2) generating web page index file.

A web page index file is needed to carry out the relevant information of recording text numbering and its source page, as information such as web page title, URL and webpage sizes, its effect has two aspects: one is setting up the follow-up web data index stage, the numbering recording document is only needed in follow-up tree node, instead of tediously long filename; Two is the URL information can finding out source web page in retrieval phase according to text numbering from web page index file, thus outputs in a particular manner on result display page.

(3) punctuation mark in text is removed, make text become the set of short character strings.

The input of this step is the web page files grabbed, and output is the text of word to be cut.

Step S200, sets up web data index file

For the web data after step S100 process sets up web data index file.Web data index file be to process after web data set up index composition file.Word indexing can be set up for described web data, also can set up glossarial index for described web data.Described index is the index created based on relevant subsequent tree in binary.Described web data index file can provide service for the search of network search engines.

In Chinese search system, the linguistic unit of index can be divided into word and word two kinds, has occurred that two kinds different index storehouse strategy thus: building storehouse strategy and building storehouse strategy based on vocabulary based on word table.Storehouse strategy is built based on word table, refer to each Chinese character in source document all as the elementary cell of index, for a word table set up in each different word, have recorded the occur position of this word in different document in word table, when retrieving, the record that the individual character in inquiry string retrieves is carried out to logical multiply computing thus obtains final result set.The advantage of building storehouse based on word table mainly contains: word table small scale, recall ratio are high, and shortcoming is that the storage space that index file takies when source document quantity is large is very large.

Storehouse strategy of building based on vocabulary is exactly for index entry sets up index database with the word expressing certain language meaning.This storehouse thought of building generally needs to carry out participle operation to source document, namely the character set in source document is decomposed into the set of word, need searching character string participle equally when retrieval, then search record in each word indexed file, after asking friendship, finally obtain retrieval set.The advantage of building storehouse based on vocabulary mainly contains: retrieval rate is fast, precision ratio is high, and shortcoming is that recall ratio is not high.

One embodiment of the invention take into account above-mentioned two kinds of relative merits setting up index database.Have employed the method that two kinds are set up index database simultaneously.

On the basis of the above embodiment of the present invention, in the step s 100, first web data pre-service comprises further classifies to the original web page captured, and then extracts the text message in webpage respectively according to classification, obtains sorted text message; Web page index file is set up in each classification that the process of generating web page index file comprises for original web page respectively.In step s 200, set up web data index file to comprise further: first, judge the capacity of the text message of each classification, when the capacity of described classification is less than 1GB, for the text message of described classification sets up word indexing, when the capacity of described classification is more than or equal to 1GB, for the text message of described classification sets up glossarial index.

In above-described embodiment, in described binary, relevant subsequent tree is defined as follows:

(1) follow-up: to the character string a in text T ₁a ₂, a ₂be called a ₁follow-up, the follow-up of last character of text T is called end mark, represents by " # "; Always have identical character in text to occur, be exactly word identical specifically or word, if some index entry a have occurred k time, a has not been the ending of text, and so a has k individual follow-up, is denoted as a [s], s=1,2 ..., k;

(2) the follow-up expression formula of unitary and the follow-up tree of unitary: suppose that T is by character string a in full ₁, a ₂..., a _n, # composition, if a wherein _i1=a _i2=...=a _ikbe identical character, be designated as a, and a _i1+1, a _i2+1..., a _ik+1be the follow-up of them respectively, then all a and the follow-up of it just constitute a follow-up expression formula a (a of unitary _i1+1, a _i2+1..., a _ik+1), describe this expression formula with one tree, a is tree root, a _i1+1, a _i2+1..., a _ik+1be its follow-up node, this tree is exactly the follow-up tree of unitary of a;

(3) the follow-up expression formula of binary and the follow-up tree of binary: the follow-up expression formula of unitary is expanded, if having identical character string a in original text T _i1a _i1+1=a _i2a _i2+1=...=a _ika _ik+1, be denoted as ab, then all ab and its follow-uply just constitute a follow-up expression formula of binary, be denoted as a (b (a _i1+2, a _i2+2..., a _ik+2));

(4) the follow-up tree representation of the binary of a is: a is tree root, a _i1+1, a _i2+1..., a _ik+1the follow-up of a, (a _i1+1, tag ₁), (a _i2+1, tag ₂) ..., (a _ik+1, tag _k) then as the follow-up node of a, wherein, tag ₁, tag ₂, tag _kbe divided into is with a _i1+1, a _i2+1..., a _ik+1for a in the follow-up tree of unitary of root _i1+1, a _i2+1..., a _ik+the sequence number of the follow-up place branch of 1;

(5) definition of interior relevant subsequent tree: set the forest formed by the follow-up of all index entries of document whole in a source document storehouse, be called the interior relevant subsequent tree in this source document storehouse, when described follow-up tree is the follow-up tree of binary, this interior relevant subsequent tree is relevant subsequent tree in binary;

In above-described embodiment, the process creating index based on relevant subsequent tree in binary specifically comprises:

First stage is scan text, statistics double word frequency, namely add up the number that each character is follow-up, for when initialization is set, allocation space provides reference, concrete steps are: beginning two character A and B reading in text, if B is not the ending of text, then frequency Freq [A] [B] of double word AB is added 1, a word will be moved after word string pointer, frequency Freq [B] [C] of statistics BC, C is the follow-up of B, and circulation carries out until text ending, then in text, the frequency statistics of all double words completes, and generates the double word frequency table of the text;

The frequency being this double word of 2, ac for the frequency of this double word of text T, ab be the frequency of this double word of 2, bc for 1,

Subordinate phase is the process of establishing of relevant subsequent tree in binary, relevant subsequent tree in initialization binary, for each index entry sets up follow-up tree, scan text again, follow-up and the next number of characters all in text is inserted in its follow-up tree, until textual scan is complete, relevant subsequent tree-model in the binary built up is outputted to text and form index file, concrete steps are: first to relevant subsequent tree initialization in binary, for each character memory allocated space, next beginning two character A and B of text are read in, if B is not the ending of text, point number set by A adds one, then the positional information Pos of B and B (B) is write in A tree, backspace character string position, read next double word B and C, perform the step above a time again, until the ending of text, finally relevant subsequent tree in the binary built up is outputted in text.

Citing: for text T, the initial logic position of a, b, c is respectively: Pos (a)=1; Pos (b)=1; Pos (c)=1 (because at the beginning every tree in all only have root node node).Read in ab, the position of a is added one, namely Pos (a)=2 are (because a tree adds a branch, adding after one is the branch number obtaining next branch), then the positional information Pos (b) of b and b is write (Pos (a)-1 branch is exactly the current branch of the value to be written of a tree) in Pos (a)-1 branch of the follow-up tree of a, namely the 1st branch of (b, 1) write tree a; Next, read in character late c, then position Pos (the b)+1=2 of b, the positional information Pos (c) of c and c is write in Pos (b)-1 branch of the follow-up tree of b, namely first branch of (c, 1) write tree b; Such circulation execution is gone down, until read in the end of text ' # ' time, by (#, 0) write c follow-up tree in last branch, in the binary of the text relevant subsequent tree index set up complete.

Set up glossarial index for for text message, before setting up glossarial index, also comprise the step of text message being carried out to participle, possess and comprise: utilize dictionary for word segmentation to carry out participle operation to text and form index term set.

Dictionary for word segmentation in the present invention adopts the dictionary based on PATRICIA tree construction, and the cutting of concrete employing increment maximum matching method, namely for a word string, from the substring that length is 2, to A _ia _i+1, whether be word, if not then by A if searching in dictionary _iindependent one-tenth word, then moves a word, continues to search A after substring _i+1a _i+2; If word, then need substring length to add one, judge A _i+1a _i+2a _i+3whether be word, if not, then by A _i+1a _i+2cut out, if it is continue to add one, until search less than word.

The frequency occurred due to long word is lower, therefore this method in matching times comparatively forward subtract word maximum matching method and want much less, time performance is better.In the process of increment coupling, structure due to dictionary is PATRICIA tree construction, there is the function of prefix lookups, so when increasing one is mated, position can be compared to the binary code of added word from last of a upper word that the match is successful then to compare, without the need to again searching from tree root, so just can find most long word by the process that once travels through, the efficiency of searching has had obvious improvement again.

After participle completes, stop words is removed to the entry be syncopated as, then add up the frequency of adjacent word in entry set, generate a two word frequency meter, for the establishment of index is prepared.

There is the index of text message, just can carry out the retrieval of text based on index, the search method of text and corresponding index model tight association.

The present invention also comprises step S300 further, receives the inquiry string of user's input, retrieves according to web data index.For different index models, concrete retrieving can be different.

In one embodiment of the invention, inquiry string is decomposed into word and word respectively, is the situation of word indexing for web data index, retrieves according to described word indexing by radical; For the situation that web data index is glossarial index, retrieve according to described glossarial index by participle.

For the situation that index is relevant subsequent tree-model in binary, concrete retrieving is:

First stage, the situation being word indexing for web data index is retrieved;

The present invention also comprises step S400 further, sorts, ranking results is showed user to the result retrieved.

The present invention does not limit sort method, and various existing ordering techniques can be used to sort to ranking results.

In one embodiment of the invention, in order to improve recall precision, further, for the situation that web data index is glossarial index, using the index terms in glossarial index file as text message, based on described interior relevant subsequent tree-model, again index is set up to described index terms, thus improve effectiveness of retrieval further.

Those skilled in the art, at consideration instructions and after putting into practice invention disclosed herein, will easily expect other embodiment of the present invention.The application is intended to contain any modification of the present invention, purposes or adaptations, and these modification, purposes or adaptations are followed general principle of the present invention and comprised the undocumented common practise in the art of the present invention or conventional techniques means.

Should be understood that, the present invention is not limited to precision architecture described above and illustrated in the accompanying drawings, and can carry out various amendment and change not departing from its scope.Scope of the present invention is only limited by appended claim.

Claims

1. based on a search method for tree structure, in Chinese search engine to the process of Chinese web page data, comprising:

Step S100, web data pre-service;

Step S200, sets up web data index file;

2. as claimed in claim 1 based on the search method of tree structure, wherein, the described web data index file in step S200 be to process after the file of web data index composition set up of web data.

3., as claimed in claim 2 based on the search method of tree structure, wherein, described web data index is word indexing.

4., as claimed in claim 2 based on the search method of tree structure, wherein, described web data index is glossarial index.

5., as claimed in claim 2 based on the search method of tree structure, wherein, described web data index is word indexing and glossarial index.

6. as claimed in claim 2 based on the search method of tree structure, wherein, described web data index is the index created based on relevant subsequent tree in binary.

7., as claimed in claim 6 based on the search method of tree structure, wherein, step S100 comprises further:

8., as claimed in claim 7 based on the search method of tree structure, wherein, in step s 200, set up web data index file and comprise further:

9., as claimed in claim 8 based on the search method of tree structure, comprise further:

10., as claimed in claim 9 based on the search method of tree structure, concrete retrieving is:

First stage, the situation being word indexing for web data index is retrieved;