CN105404677A - Tree structure based retrieval method - Google Patents

Tree structure based retrieval method Download PDF

Info

Publication number
CN105404677A
CN105404677A CN201510818902.0A CN201510818902A CN105404677A CN 105404677 A CN105404677 A CN 105404677A CN 201510818902 A CN201510818902 A CN 201510818902A CN 105404677 A CN105404677 A CN 105404677A
Authority
CN
China
Prior art keywords
index
tree
word
web data
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510818902.0A
Other languages
Chinese (zh)
Other versions
CN105404677B (en
Inventor
陈虹宇
罗阳
苗宁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan Shenhu Technology Co.,Ltd.
Original Assignee
SICHUAN SHENHU TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by SICHUAN SHENHU TECHNOLOGY Co Ltd filed Critical SICHUAN SHENHU TECHNOLOGY Co Ltd
Priority to CN201510818902.0A priority Critical patent/CN105404677B/en
Publication of CN105404677A publication Critical patent/CN105404677A/en
Application granted granted Critical
Publication of CN105404677B publication Critical patent/CN105404677B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9027Trees
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention proposes a tree structure based retrieval method. The method is used for processing Chinese web page data in Chinese searching engines. The method comprises: S100, pre-processing web page data; S200, establishing a web page data index file; and S300, receiving a query character string input by a user and carrying out retrieval according to a web page data index. According to the tree structure based retrieval method disclosed by the present invention, an intra-binary correlated subsequent tree model is adopted to establish the index for the web page data, and the advantages and disadvantages of a character index and a word index are simultaneously considered, so that when the index space is reduced, the retrieving efficiency is enhanced.

Description

A kind of search method based on tree structure
Technical field
The present invention relates to data processing field, be specifically related to a kind of search method based on tree structure.
Background technology
Along with the develop rapidly of internet, the exponential increase of information, the diversity of data mode, people are difficult in the information of Hai Liang, find the part meeting oneself demand rapidly.The appearance of full-text database, substantially improves this present situation.Full-text database, also referred to as text database, it is the system of management mass text.The work that full-text database will complete remains two large functions of traditional database: storing and retrieval, is exactly specifically the storage of text data and the retrieval of arbitrary string.Character string as search condition can be constant ocra font ocr string, also can be the string assemble that regular expression (or other modes, such as distance limit etc.) a group of representing has common trait.
More common and popular Full-Text Index Model has following several model at present: documents signed (DS) (SignatureFiles), bitmap (BitMap), inverted list (InvertedList), Σ 2matrix Pat tree and Pat array etc.These models are under the effort of brainstrust, and quite maturation is also used widely in practice.
The method extended out from title index is exactly present most widely used inverted list model.It has the advantages that to create index speed, widespread use in network search engines.But the storage space needed for it is comparatively large, and inquiry velocity is slower.Although documents signed (DS) realizes simple, be to locate the vector that a suitable hash function and width be applicable to very difficult, and different because of object.If do not chosen, then Query Result just there will be suitable uncertainty.Bitmap file index structure thinking is simple, and easy to use, time efficiency is high, especially efficient in Boolean retrieval, but its space efficiency is very low, even if employ bits compression algorithm, is still difficult to accept.The great advantage of Pat tree-model is that recall precision is very high, and especially special to model retrieval is as higher in the recall precision such as prefix search, range retrieval.But the same with bitmap models, space efficiency is extremely low, and in constructive process, space expense is larger, creates efficiency also very low.Pat array is the amendment to Pat tree, and the leaf node serialization that Pat sets just is obtained Pat array by it.Although Pat array have compressed the expense in constructive process to a great extent, because adopt the storage mode of array, it creates and merges needs mobile a large amount of data, and dynamic is difficult to satisfactory.
Summary of the invention
At least part of solution problems of the prior art, the present invention proposes a kind of search method based on tree structure, in Chinese search engine to the process of Chinese web page data, comprising:
Step S100, web data pre-service;
Step S200, sets up web data index file;
Step S300, receives the inquiry string of user's input, retrieves according to web data index.
The described search method based on tree structure, wherein, the described web data index file in step S200 be to process after web data set up web data index composition file.
The described search method based on tree structure, wherein, described web data index is word indexing.
The described search method based on tree structure, wherein, described web data index is glossarial index.
The described search method based on tree structure, wherein, described web data index is word indexing and glossarial index.
The described search method based on tree structure, wherein, described web data index is the index created based on relevant subsequent tree in binary.
The described search method based on tree structure, wherein, step S100 comprises further:
First the original web page captured is classified, and then extract the text message in webpage respectively according to classification, obtain sorted text message; Web page index file is set up in each classification that the process of generating web page index file comprises for original web page respectively.
The described search method based on tree structure, wherein, in step s 200, set up web data index file and comprise further:
First, judge the capacity of the text message of each classification, when the capacity of described classification is less than 1GB, for the text message of described classification sets up word indexing, when the capacity of described classification is more than or equal to 1GB, for the text message of described classification sets up glossarial index.
The described search method based on tree structure, comprises further:
Inquiry string is decomposed into word and word respectively, is the situation of word indexing for web data index, retrieves according to described word indexing by radical; For the situation that web data index is glossarial index, retrieve according to described glossarial index by participle.
The described search method based on tree structure, concrete retrieving is:
First stage, the situation being word indexing for web data index is retrieved;
First each word after inquiry string decomposition is sequentially read in, get first character A, for word indexing, finding with A in relevant subsequent tree in binary is the tree of root, then the next word B of branch ground matching inquiry character string one by one in the leaf of tree A, the next number of B is then added queue by the words matching B, until end is all mated in whole branches of A; Forwarding to B is the tree of root, the branch number of B tree is taken out from queue, search the next word C that corresponding leafy node comes in matched character string, circulation like this is not until once to match in matching process or inquiry string all mates end, if the match is successful, then mean the original text that have found and comprise inquiry string;
Subordinate phase, the situation being glossarial index for web data index is retrieved;
First each word after inquiry string decomposition is sequentially read in, get first word A, for glossarial index, finding with A in relevant subsequent tree in binary is the tree of root, then the next word B of branch ground matching inquiry character string one by one in the leaf of tree A, the next number of B is then added queue by the words matching B, until end is all mated in whole branches of A; Forwarding to B is the tree of root, the branch number of B tree is taken out from queue, search the next word C that corresponding leafy node comes in matched character string, circulation like this is not until once to match in matching process or inquiry string all mates end, if the match is successful, then mean the original text that have found and comprise inquiry string.
The present invention adopts relevant subsequent tree-model in binary to be that web data creates index, considers the relative merits of word indexing and glossarial index simultaneously, while minimizing index space, improves recall precision.
Accompanying drawing explanation
Fig. 1 is the process flow diagram of the search method that the present invention is based on tree structure;
Embodiment
Below in conjunction with accompanying drawing of the present invention, technical scheme of the present invention is clearly and completely described.Here will be described exemplary embodiment in detail, its sample table shows in the accompanying drawings.When description below relates to accompanying drawing, unless otherwise indicated, the same numbers in different accompanying drawing represents same or analogous key element.Embodiment described in following exemplary embodiment does not represent all embodiments consistent with the present invention.On the contrary, they only with as in appended claims describe in detail, the example of apparatus and method that aspects more of the present invention are consistent.
First, be described below using some terms in the present invention:
(1) source document storehouse: source document storehouse refers to the original web page file set captured from internet by web crawlers, being used for setting up index is that user search uses, this set is not static, according to the strategy captured or regular batch updating, or incremental update, main to ensure the timeliness n of system, capture timely by emerging webpage as far as possible and come to set up index, meet the Search Requirement of user;
(2) pre-service: pre-service refers to the process processed the web page files captured, comprise: set up index web page library, info web extracts, setting up index web page library is exactly the URL that will realize a given webpage, can find the webpage corresponding to this URL in index web page library; It is exactly from webpage, extract the text message set up required for index that info web extracts, and comprises title, text etc.;
(3) text: the set of the text message formed is extracted through the info web of pre-treatment step in source document storehouse, is the direct object setting up index;
(4) dictionary for word segmentation: dictionary for word segmentation is the basis of participle operation, is the set of Chinese vocabulary, is used for distinguishing when participle and being syncopated as the word in text;
(5) participle operation: participle operation is exactly process text dividing being become word combination, the inquiry string of pretreated text and user's input is all its operand, it correctly can match vocabulary according to dictionary for word segmentation in text-string, weed out stop words, export the set of index term or term;
(6) interior relevant subsequent tree index is set up: this operation is exactly the process of relevant subsequent tree in setting up the index term set formed after text participle;
(7) glossarial index file: glossarial index file is exactly the index file generated in previous action, in glossarial index file, each index term has a follow-up tree being root with it, contains numbering and its follow-up of index term place text in the node of tree;
(8) inquiry string: the query demand of user normally represents with the form of character string, the word that the simplest inquiry string is single often, for longer inquiry string, as phrase or sentence, carry out word segmentation processing, the segmentation methods adopted should be consistent with the segmentation methods used text, to ensure the accuracy of result for retrieval;
(9) search operaqtion: for the set of term according to the lookup algorithm of interior relevant subsequent tree-model, search in glossarial index file, the text matched number is outputted to result set;
(10) result set: result set is the text number generally referring to the inquiry string place obtained by search operaqtion, the text obtained may be a series of document, this just needs according to certain information retrieval model collection of document process, draw the collection of document that degree of correlation is maximum, present to user.
The related definition of interior relevant subsequent tree:
(1) follow-up: to the character string a in text T 1a 2, a 2be called a 1follow-up, the follow-up of last character of text T is called end mark, represents by " # ".Follow-up definition is the basis of interior relevant subsequent tree-model, always have identical character in text to occur, be exactly identical word or word specifically, if some index entry a have occurred k time, so a has just had k follow-up (a is not the ending of text), is denoted as a [s], s=1,2 ..., k.
(2) the follow-up expression formula of unitary and the follow-up tree of unitary: suppose that T is by character string a in full 1, a 2..., a n, # composition, if a wherein i1=a i2=...=a ikbe identical character, be designated as a, and a i1+1, a i2+1..., a ik+1be the follow-up of them respectively, then all a and the follow-up of it just constitute a follow-up expression formula a (a of unitary i1+1, a i2+1..., a ik+1), describe this expression formula with one tree, a is tree root, a i1+1, a i2+1..., a ik+1be its follow-up node, this tree just becomes the follow-up tree of unitary of a.
(3) the follow-up expression formula of binary and the follow-up tree of binary: the follow-up expression formula of unitary is expanded, if having identical character string a in original text T i1a i1+1=a i2a i2+1=...=a ika ik+1, be denoted as ab, then all ab and its follow-uply just constitute a follow-up expression formula of binary, be denoted as a (b (a i1+2, a i2+2..., a ik+2)).
Citing: (text in the present invention's citing is all for this text in text abcbacabacc#, called after T), ab character string has occurred 2 times, then its text library exists a follow-up expression formula of binary and is: a (b (c, a)).This expression formula is represented, by a with tree i1+2, a i2+2..., a ik+2these are follow-up represents by its sequence number of their place branches in the follow-up tree of unitary taking b as root, can change into: a (b (1,3)) for the follow-up expression formula of binary upper example.The follow-up tree representation of binary of a is: a is tree root, a i1+1, a i2+1..., a ik+1the follow-up of a, (a i1+1, tag 1), (a i2+1, tag 2) ..., (a ik+1, tag k) then as the follow-up node of a, wherein, tag 1, tag 2, tag kbe divided into is with a i1+1, a i2+1..., a ik+1for a in the follow-up tree of unitary of root i1+1, a i2+1..., a ik+1the sequence number of follow-up place branch.Be followed successively by b, c, b, c for above-mentioned the follow-up of text T, a, corresponding follow-up node is followed successively by (b, 1), (c, 2), (b, 3), (c, 3).
The definition of interior relevant subsequent tree: set the forest formed by the follow-up of all index entries of document whole in a source document storehouse, be called the interior relevant subsequent tree in this source document storehouse, when described follow-up tree is the follow-up tree of binary, this interior relevant subsequent tree is relevant subsequent tree in binary.
See Fig. 1, the present invention propose a kind of search method based on tree structure, in Chinese search engine to the process of Chinese web page data, comprising:
Step S100, web data pre-service;
(1) extract the text message in webpage, generate corresponding text and text is numbered.This process groundwork is extracted by the text message in web page files, generates the document being used for networking page data index.
(2) generating web page index file.
A web page index file is needed to carry out the relevant information of recording text numbering and its source page, as information such as web page title, URL and webpage sizes, its effect has two aspects: one is setting up the follow-up web data index stage, the numbering recording document is only needed in follow-up tree node, instead of tediously long filename; Two is the URL information can finding out source web page in retrieval phase according to text numbering from web page index file, thus outputs in a particular manner on result display page.
(3) punctuation mark in text is removed, make text become the set of short character strings.
The input of this step is the web page files grabbed, and output is the text of word to be cut.
Step S200, sets up web data index file
For the web data after step S100 process sets up web data index file.Web data index file be to process after web data set up index composition file.Word indexing can be set up for described web data, also can set up glossarial index for described web data.Described index is the index created based on relevant subsequent tree in binary.Described web data index file can provide service for the search of network search engines.
In Chinese search system, the linguistic unit of index can be divided into word and word two kinds, has occurred that two kinds different index storehouse strategy thus: building storehouse strategy and building storehouse strategy based on vocabulary based on word table.Storehouse strategy is built based on word table, refer to each Chinese character in source document all as the elementary cell of index, for a word table set up in each different word, have recorded the occur position of this word in different document in word table, when retrieving, the record that the individual character in inquiry string retrieves is carried out to logical multiply computing thus obtains final result set.The advantage of building storehouse based on word table mainly contains: word table small scale, recall ratio are high, and shortcoming is that the storage space that index file takies when source document quantity is large is very large.
Storehouse strategy of building based on vocabulary is exactly for index entry sets up index database with the word expressing certain language meaning.This storehouse thought of building generally needs to carry out participle operation to source document, namely the character set in source document is decomposed into the set of word, need searching character string participle equally when retrieval, then search record in each word indexed file, after asking friendship, finally obtain retrieval set.The advantage of building storehouse based on vocabulary mainly contains: retrieval rate is fast, precision ratio is high, and shortcoming is that recall ratio is not high.
One embodiment of the invention take into account above-mentioned two kinds of relative merits setting up index database.Have employed the method that two kinds are set up index database simultaneously.
On the basis of the above embodiment of the present invention, in the step s 100, first web data pre-service comprises further classifies to the original web page captured, and then extracts the text message in webpage respectively according to classification, obtains sorted text message; Web page index file is set up in each classification that the process of generating web page index file comprises for original web page respectively.In step s 200, set up web data index file to comprise further: first, judge the capacity of the text message of each classification, when the capacity of described classification is less than 1GB, for the text message of described classification sets up word indexing, when the capacity of described classification is more than or equal to 1GB, for the text message of described classification sets up glossarial index.
In above-described embodiment, in described binary, relevant subsequent tree is defined as follows:
(1) follow-up: to the character string a in text T 1a 2, a 2be called a 1follow-up, the follow-up of last character of text T is called end mark, represents by " # "; Always have identical character in text to occur, be exactly word identical specifically or word, if some index entry a have occurred k time, a has not been the ending of text, and so a has k individual follow-up, is denoted as a [s], s=1,2 ..., k;
(2) the follow-up expression formula of unitary and the follow-up tree of unitary: suppose that T is by character string a in full 1, a 2..., a n, # composition, if a wherein i1=a i2=...=a ikbe identical character, be designated as a, and a i1+1, a i2+1..., a ik+1be the follow-up of them respectively, then all a and the follow-up of it just constitute a follow-up expression formula a (a of unitary i1+1, a i2+1..., a ik+1), describe this expression formula with one tree, a is tree root, a i1+1, a i2+1..., a ik+1be its follow-up node, this tree is exactly the follow-up tree of unitary of a;
(3) the follow-up expression formula of binary and the follow-up tree of binary: the follow-up expression formula of unitary is expanded, if having identical character string a in original text T i1a i1+1=a i2a i2+1=...=a ika ik+1, be denoted as ab, then all ab and its follow-uply just constitute a follow-up expression formula of binary, be denoted as a (b (a i1+2, a i2+2..., a ik+2));
(4) the follow-up tree representation of the binary of a is: a is tree root, a i1+1, a i2+1..., a ik+1the follow-up of a, (a i1+1, tag 1), (a i2+1, tag 2) ..., (a ik+1, tag k) then as the follow-up node of a, wherein, tag 1, tag 2, tag kbe divided into is with a i1+1, a i2+1..., a ik+1for a in the follow-up tree of unitary of root i1+1, a i2+1..., a ik+the sequence number of the follow-up place branch of 1;
(5) definition of interior relevant subsequent tree: set the forest formed by the follow-up of all index entries of document whole in a source document storehouse, be called the interior relevant subsequent tree in this source document storehouse, when described follow-up tree is the follow-up tree of binary, this interior relevant subsequent tree is relevant subsequent tree in binary;
In above-described embodiment, the process creating index based on relevant subsequent tree in binary specifically comprises:
First stage is scan text, statistics double word frequency, namely add up the number that each character is follow-up, for when initialization is set, allocation space provides reference, concrete steps are: beginning two character A and B reading in text, if B is not the ending of text, then frequency Freq [A] [B] of double word AB is added 1, a word will be moved after word string pointer, frequency Freq [B] [C] of statistics BC, C is the follow-up of B, and circulation carries out until text ending, then in text, the frequency statistics of all double words completes, and generates the double word frequency table of the text;
The frequency being this double word of 2, ac for the frequency of this double word of text T, ab be the frequency of this double word of 2, bc for 1,
Subordinate phase is the process of establishing of relevant subsequent tree in binary, relevant subsequent tree in initialization binary, for each index entry sets up follow-up tree, scan text again, follow-up and the next number of characters all in text is inserted in its follow-up tree, until textual scan is complete, relevant subsequent tree-model in the binary built up is outputted to text and form index file, concrete steps are: first to relevant subsequent tree initialization in binary, for each character memory allocated space, next beginning two character A and B of text are read in, if B is not the ending of text, point number set by A adds one, then the positional information Pos of B and B (B) is write in A tree, backspace character string position, read next double word B and C, perform the step above a time again, until the ending of text, finally relevant subsequent tree in the binary built up is outputted in text.
Citing: for text T, the initial logic position of a, b, c is respectively: Pos (a)=1; Pos (b)=1; Pos (c)=1 (because at the beginning every tree in all only have root node node).Read in ab, the position of a is added one, namely Pos (a)=2 are (because a tree adds a branch, adding after one is the branch number obtaining next branch), then the positional information Pos (b) of b and b is write (Pos (a)-1 branch is exactly the current branch of the value to be written of a tree) in Pos (a)-1 branch of the follow-up tree of a, namely the 1st branch of (b, 1) write tree a; Next, read in character late c, then position Pos (the b)+1=2 of b, the positional information Pos (c) of c and c is write in Pos (b)-1 branch of the follow-up tree of b, namely first branch of (c, 1) write tree b; Such circulation execution is gone down, until read in the end of text ' # ' time, by (#, 0) write c follow-up tree in last branch, in the binary of the text relevant subsequent tree index set up complete.
Set up glossarial index for for text message, before setting up glossarial index, also comprise the step of text message being carried out to participle, possess and comprise: utilize dictionary for word segmentation to carry out participle operation to text and form index term set.
Dictionary for word segmentation in the present invention adopts the dictionary based on PATRICIA tree construction, and the cutting of concrete employing increment maximum matching method, namely for a word string, from the substring that length is 2, to A ia i+1, whether be word, if not then by A if searching in dictionary iindependent one-tenth word, then moves a word, continues to search A after substring i+1a i+2; If word, then need substring length to add one, judge A i+1a i+2a i+3whether be word, if not, then by A i+1a i+2cut out, if it is continue to add one, until search less than word.
The frequency occurred due to long word is lower, therefore this method in matching times comparatively forward subtract word maximum matching method and want much less, time performance is better.In the process of increment coupling, structure due to dictionary is PATRICIA tree construction, there is the function of prefix lookups, so when increasing one is mated, position can be compared to the binary code of added word from last of a upper word that the match is successful then to compare, without the need to again searching from tree root, so just can find most long word by the process that once travels through, the efficiency of searching has had obvious improvement again.
After participle completes, stop words is removed to the entry be syncopated as, then add up the frequency of adjacent word in entry set, generate a two word frequency meter, for the establishment of index is prepared.
There is the index of text message, just can carry out the retrieval of text based on index, the search method of text and corresponding index model tight association.
The present invention also comprises step S300 further, receives the inquiry string of user's input, retrieves according to web data index.For different index models, concrete retrieving can be different.
In one embodiment of the invention, inquiry string is decomposed into word and word respectively, is the situation of word indexing for web data index, retrieves according to described word indexing by radical; For the situation that web data index is glossarial index, retrieve according to described glossarial index by participle.
For the situation that index is relevant subsequent tree-model in binary, concrete retrieving is:
First stage, the situation being word indexing for web data index is retrieved;
First each word after inquiry string decomposition is sequentially read in, get first character A, for word indexing, finding with A in relevant subsequent tree in binary is the tree of root, then the next word B of branch ground matching inquiry character string one by one in the leaf of tree A, the next number of B is then added queue by the words matching B, until end is all mated in whole branches of A; Forwarding to B is the tree of root, the branch number of B tree is taken out from queue, search the next word C that corresponding leafy node comes in matched character string, circulation like this is not until once to match in matching process or inquiry string all mates end, if the match is successful, then mean the original text that have found and comprise inquiry string;
Subordinate phase, the situation being glossarial index for web data index is retrieved;
First each word after inquiry string decomposition is sequentially read in, get first word A, for glossarial index, finding with A in relevant subsequent tree in binary is the tree of root, then the next word B of branch ground matching inquiry character string one by one in the leaf of tree A, the next number of B is then added queue by the words matching B, until end is all mated in whole branches of A; Forwarding to B is the tree of root, the branch number of B tree is taken out from queue, search the next word C that corresponding leafy node comes in matched character string, circulation like this is not until once to match in matching process or inquiry string all mates end, if the match is successful, then mean the original text that have found and comprise inquiry string.
The present invention also comprises step S400 further, sorts, ranking results is showed user to the result retrieved.
The present invention does not limit sort method, and various existing ordering techniques can be used to sort to ranking results.
In one embodiment of the invention, in order to improve recall precision, further, for the situation that web data index is glossarial index, using the index terms in glossarial index file as text message, based on described interior relevant subsequent tree-model, again index is set up to described index terms, thus improve effectiveness of retrieval further.
The present invention adopts relevant subsequent tree-model in binary to be that web data creates index, considers the relative merits of word indexing and glossarial index simultaneously, while minimizing index space, improves recall precision.
Those skilled in the art, at consideration instructions and after putting into practice invention disclosed herein, will easily expect other embodiment of the present invention.The application is intended to contain any modification of the present invention, purposes or adaptations, and these modification, purposes or adaptations are followed general principle of the present invention and comprised the undocumented common practise in the art of the present invention or conventional techniques means.
Should be understood that, the present invention is not limited to precision architecture described above and illustrated in the accompanying drawings, and can carry out various amendment and change not departing from its scope.Scope of the present invention is only limited by appended claim.

Claims (10)

1. based on a search method for tree structure, in Chinese search engine to the process of Chinese web page data, comprising:
Step S100, web data pre-service;
Step S200, sets up web data index file;
Step S300, receives the inquiry string of user's input, retrieves according to web data index.
2. as claimed in claim 1 based on the search method of tree structure, wherein, the described web data index file in step S200 be to process after the file of web data index composition set up of web data.
3., as claimed in claim 2 based on the search method of tree structure, wherein, described web data index is word indexing.
4., as claimed in claim 2 based on the search method of tree structure, wherein, described web data index is glossarial index.
5., as claimed in claim 2 based on the search method of tree structure, wherein, described web data index is word indexing and glossarial index.
6. as claimed in claim 2 based on the search method of tree structure, wherein, described web data index is the index created based on relevant subsequent tree in binary.
7., as claimed in claim 6 based on the search method of tree structure, wherein, step S100 comprises further:
First the original web page captured is classified, and then extract the text message in webpage respectively according to classification, obtain sorted text message; Web page index file is set up in each classification that the process of generating web page index file comprises for original web page respectively.
8., as claimed in claim 7 based on the search method of tree structure, wherein, in step s 200, set up web data index file and comprise further:
First, judge the capacity of the text message of each classification, when the capacity of described classification is less than 1GB, for the text message of described classification sets up word indexing, when the capacity of described classification is more than or equal to 1GB, for the text message of described classification sets up glossarial index.
9., as claimed in claim 8 based on the search method of tree structure, comprise further:
Inquiry string is decomposed into word and word respectively, is the situation of word indexing for web data index, retrieves according to described word indexing by radical; For the situation that web data index is glossarial index, retrieve according to described glossarial index by participle.
10., as claimed in claim 9 based on the search method of tree structure, concrete retrieving is:
First stage, the situation being word indexing for web data index is retrieved;
First each word after inquiry string decomposition is sequentially read in, get first character A, for word indexing, finding with A in relevant subsequent tree in binary is the tree of root, then the next word B of branch ground matching inquiry character string one by one in the leaf of tree A, the next number of B is then added queue by the words matching B, until end is all mated in whole branches of A; Forwarding to B is the tree of root, the branch number of B tree is taken out from queue, search the next word C that corresponding leafy node comes in matched character string, circulation like this is not until once to match in matching process or inquiry string all mates end, if the match is successful, then mean the original text that have found and comprise inquiry string;
Subordinate phase, the situation being glossarial index for web data index is retrieved;
First each word after inquiry string decomposition is sequentially read in, get first word A, for glossarial index, finding with A in relevant subsequent tree in binary is the tree of root, then the next word B of branch ground matching inquiry character string one by one in the leaf of tree A, the next number of B is then added queue by the words matching B, until end is all mated in whole branches of A; Forwarding to B is the tree of root, the branch number of B tree is taken out from queue, search the next word C that corresponding leafy node comes in matched character string, circulation like this is not until once to match in matching process or inquiry string all mates end, if the match is successful, then mean the original text that have found and comprise inquiry string.
CN201510818902.0A 2015-11-20 2015-11-20 A kind of search method based on tree structure Active CN105404677B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510818902.0A CN105404677B (en) 2015-11-20 2015-11-20 A kind of search method based on tree structure

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510818902.0A CN105404677B (en) 2015-11-20 2015-11-20 A kind of search method based on tree structure

Publications (2)

Publication Number Publication Date
CN105404677A true CN105404677A (en) 2016-03-16
CN105404677B CN105404677B (en) 2018-12-18

Family

ID=55470166

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510818902.0A Active CN105404677B (en) 2015-11-20 2015-11-20 A kind of search method based on tree structure

Country Status (1)

Country Link
CN (1) CN105404677B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105930546A (en) * 2016-07-08 2016-09-07 北京北大英华科技有限公司 File association display method
CN107239454A (en) * 2016-03-28 2017-10-10 福建天晴数码有限公司 Search method and system based on text database
CN107807886A (en) * 2016-09-09 2018-03-16 质子世界国际公司 Index management in flash memories
CN108241695A (en) * 2016-12-26 2018-07-03 北京国双科技有限公司 Information processing method and device
CN108694229A (en) * 2017-04-10 2018-10-23 富士通株式会社 String data analytical equipment and string data analysis method
CN112612845A (en) * 2020-12-22 2021-04-06 中国建设银行股份有限公司 Method and device for realizing organization mechanism view, electronic equipment and readable storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030167260A1 (en) * 2001-04-24 2003-09-04 Takahiro Nakamura Retrieval system of secondary data added documents in database, and program
CN1873642A (en) * 2006-04-29 2006-12-06 上海世纪互联信息系统有限公司 Searching engine with automating sorting function
CN101553810A (en) * 2006-08-10 2009-10-07 夏普株式会社 Character converting device and character converting device control method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030167260A1 (en) * 2001-04-24 2003-09-04 Takahiro Nakamura Retrieval system of secondary data added documents in database, and program
CN1873642A (en) * 2006-04-29 2006-12-06 上海世纪互联信息系统有限公司 Searching engine with automating sorting function
CN101553810A (en) * 2006-08-10 2009-10-07 夏普株式会社 Character converting device and character converting device control method

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107239454A (en) * 2016-03-28 2017-10-10 福建天晴数码有限公司 Search method and system based on text database
CN105930546A (en) * 2016-07-08 2016-09-07 北京北大英华科技有限公司 File association display method
CN105930546B (en) * 2016-07-08 2020-04-03 北京北大英华科技有限公司 File association display method
CN107807886A (en) * 2016-09-09 2018-03-16 质子世界国际公司 Index management in flash memories
CN107807886B (en) * 2016-09-09 2021-07-02 质子世界国际公司 Index management in flash memory
CN108241695A (en) * 2016-12-26 2018-07-03 北京国双科技有限公司 Information processing method and device
CN108241695B (en) * 2016-12-26 2021-11-02 北京国双科技有限公司 Information processing method and device
CN108694229A (en) * 2017-04-10 2018-10-23 富士通株式会社 String data analytical equipment and string data analysis method
CN108694229B (en) * 2017-04-10 2022-06-03 富士通株式会社 String data analysis device and string data analysis method
CN112612845A (en) * 2020-12-22 2021-04-06 中国建设银行股份有限公司 Method and device for realizing organization mechanism view, electronic equipment and readable storage medium

Also Published As

Publication number Publication date
CN105404677B (en) 2018-12-18

Similar Documents

Publication Publication Date Title
CN108415902B (en) Named entity linking method based on search engine
CN105404677A (en) Tree structure based retrieval method
CN104199965B (en) Semantic information retrieval method
US20150178273A1 (en) Unsupervised Relation Detection Model Training
CN101794307A (en) Vehicle navigation POI (Point of Interest) search engine based on internetwork word segmentation idea
CN103678576A (en) Full-text retrieval system based on dynamic semantic analysis
CN103823838A (en) Method for inputting and comparing multi-format documents
CN107844493B (en) File association method and system
CN107943919B (en) A kind of enquiry expanding method of session-oriented formula entity search
CN110083683B (en) Entity semantic annotation method based on random walk
Zu et al. Resume information extraction with a novel text block segmentation algorithm
CN105912662A (en) Coreseek-based vertical search engine research and optimization method
CN110879834A (en) Viewpoint retrieval system based on cyclic convolution network and viewpoint retrieval method thereof
CN115563313A (en) Knowledge graph-based document book semantic retrieval system
US11741064B2 (en) Fuzzy search using field-level deletion neighborhoods
Jain et al. Context sensitive text summarization using k means clustering algorithm
CN105224624A (en) A kind of method and apparatus realizing down the quick merger of row chain
CN109783599A (en) Knowledge mapping search method and system based on multi storage
CN111143400B (en) Full stack type retrieval method, system, engine and electronic equipment
CN107229714B (en) Full-text search engine based on distributed database
CN105426490A (en) Tree structure based indexing method
CN117235199A (en) Information intelligent matching retrieval method based on document tree
CN101576877A (en) Fast word segmentation realization method
CN114036371A (en) Search term recommendation method, device, equipment and computer-readable storage medium
CN102508920A (en) Information retrieval method based on Boosting sorting algorithm

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20230606

Address after: F13, Building 11, Zone D, New Economic Industrial Park, No. 99, West Section of Hupan Road, Xinglong Street, Tianfu New District, Chengdu, Sichuan, 610000

Patentee after: Sichuan Shenhu Technology Co.,Ltd.

Address before: No. 5, 1st Floor, Unit 1, Building 19, No. 177, Middle Section of Tianfu Avenue, High tech Zone, Chengdu, Sichuan, 610043

Patentee before: SICHUAN CINGHOO TECHNOLOGY Co.,Ltd.

TR01 Transfer of patent right