Embodiment
Below with reference to the accompanying drawings and in conjunction with the embodiments, describe the present invention in detail.
Fig. 1 shows the process flow diagram of the implementation method of the relevant search according to the embodiment of the present invention, comprising:
Step S10, obtains the search need character string of user's input;
Step S20, obtains multiple queries word by search need character string participle;
Step S30, obtains the relevant in order search listing of each query word in inverted index;
Step S40, returns to user by the relevant in order search listing of each query word.
This method adopts inverted structure, can provide relevant search expeditiously to user.
Preferably, step S20 comprises: build dictionary for word segmentation in advance; Adopt two-way maximum matching method by search need string matching dictionary for word segmentation; Disambiguation dictionary is utilized to carry out word sense disambiguation the inconsistent part of coupling.The present embodiment realizes easily via computer programming.Utilize Chinese words segmentation, search need is changed into the query word that several are concrete, be convenient to carry out subsequent treatment.
Preferably, adopt two array algroithm to build dictionary for word segmentation, specifically comprise:
1) create Trie tree according to the entry of resource dictionary, such as, the dictionary creation module creation Trie that on market, the upright intelligence of current public offering is thought in intelligent analysis system 4.1 editions can be used to set;
2) create even numbers group, comprising:
Integer array base [] and check [] is formed
3) in being set by Trie, the child nodes of root node adds queue, wherein, sorts from big to small to the child nodes number that the element in queue has according to it;
4) first element of queue is taken out;
5) if this element has child nodes, then to the character value B1 of all child nodes of this element, B2 ... Bn, get one and meet check [H+B1], check [H+B2] ... check [H+Bn] is the value H of 0, once find this H, base [i]=H is then set, check [H+B1]=check [H+B2]=...=check [H+Bn]=i, wherein, H+B1, H+B2 ... H+Bn is the subscript position of current child nodes in even numbers group, i is the subscript position of the father node of current child nodes, if the ending character of this element representation entry, base [i]=-H is then set,
6) if this element does not have child nodes, then base [i]=-i is set;
7) step 4-6 is repeated, until all elements all takes in queue;
8) dictionary for word segmentation is formed with even numbers group.
Adopt the dictionary of the storage mode of even numbers group can determine an entry fast whether in dictionary, also can know its subscript position.Such as, entry " 2345 ", retrieving is as follows: base [0]+2=2, check whether base [2] is negative, be not negative, then base [2]+3=2+3=5, checks whether base [5] is negative, is not negative, then base [5]+4=5+4=9, check whether base [9] is negative, is not negative, then base [9]+5=9+5=14, check whether base [14] is negative, be found to be negative and equal-14, illustrating and have found word " 2345 ", and be designated as 14 under knowing the data at the end of this word.The value of base [2], base [3], base [4], base [5] is assume that in this example.
Trie tree is also known as dictionary tree, word lookup tree, and be a kind of tree structure, for preserving a large amount of character strings, Fig. 2 shows the schematic diagram set according to the Trie of the embodiment of the present invention.Its fundamental property is:
1, root node does not comprise character, and except root node, each node only comprises a character.
2, from root node to a certain node, on path, the Connection operator of process gets up, and is the character string that this node is corresponding.
3, the character that comprises of all child nodes of each node is not identical.
The present embodiment realizes easily via computer programming.
Preferably, this method also comprises establishment inverted index, and it comprises multiple item, and every comprises the address that a property value has each record of property value, and each property value records a query word respectively, and each record is respectively the search need character string obtained for each time.Fig. 3 shows the schematic diagram of the inverted index according to the embodiment of the present invention.The present embodiment realizes easily via computer programming.
Inverted index (English: Invertedindex), also be often called as reverse indexing, insert archives or reverse archives, be a kind of indexing means, be used to be stored in the mapping of the memory location of certain word in a document or one group of document under full-text search.It is data structure the most frequently used in DRS.There is the inverted index form that two kinds are different:
Article one, the horizontal inverted index (or arranging file index) of record comprises the list of the document of each reference words.
The horizontal inverted index (or complete inverted index) of a word comprises again the position of each word in a document.The form of the latter provides more compatibility (such as phrase search), but needs more Time and place to create.
Example is for English, and here wants indexed text:
T0="itiswhatitis"
T1="whatisit"
T2="itisabanana"
We just can obtain inverted file index below:
"a":{2}
"banana":{2}
"is":{0,1,2}
"it":{0,1,2}
"what":{0,1}
The condition " what " of retrieval, " is " and " it " is by this set of correspondence:.
To identical word, obtain these complete inverted indexs below, the paired data be made up of the word result of number of documents and current queries.Equally, the word result of number of documents and current queries is all started from scratch.So, " banana ": { (2,3) } in other words " banana ", in the 3rd document (T2), and are the 4th word (address are 3) in the position of the 3rd document.
"a":{(2,2)}
"banana":{(2,3)}
"is":{(0,1),(0,4),(1,1),(2,1)}
"it":{(0,0),(0,3),(1,2),(2,0)}
"what":{(0,2),(1,0)}
If execution phrase search " whatisit ", the whole words result place document separately obtaining this phrase is document 0 and document 1.But the continuous print condition of this phrase retrieval only obtains at document 1.
Application inverted index data structure is typical search engine retrieving algorithm part and parcel.The target of a search engine execution is exactly the speed of Optimizing Queries: find the place that certain word occurs in a document.In the past, forward index developed the list of the word for storing each document, then turned around to develop a kind of inverted index.The inquiry of forward index often meets the orderly full-text query frequently of each document and each word is verifying the such inquiry of the checking in document.
In fact, the restriction of time, internal memory, processor etc. resource, technical forward index is irrealizable.In order to the word list of each document of alternative forward index, the inverted index data structure listing the list of word all places document of each inquiry develops out.Along with the establishment of inverted index, inquiry of today indicates by word immediately and obtains rapidly result (through storing at random).Random storage is also considered to usually faster than sequential storage.
Preferably, step S30 comprises: with current query word coupling inverted index; The search need character string of each record is obtained as relevant search list from the address that the property value matched is corresponding; According to the attribute of each search need character string to relevant search list ordering.The present embodiment realizes easily via computer programming.
Preferably, according to the attribute of each search need character string, relevant search list ordering is comprised: to each search need character string marking Score=α * T+ β * H+ γ * D+ δ * M in relevant search list ordering; Wherein, the α weight that to be weight, the β arranged T be is arranged H, γ are the weight, the δ that arrange D is the weight arranged M, inputs number of times, D is search need character string inputs number of times in one day, M is search need character string inputs number of times in one month in one hour that T is total input number of times of search need character string, H is search need character string.The present embodiment realizes easily via computer programming.
Preferably, this method also comprises with even numbers group and attribute array structure relevant search feedback dictionary, obtains T, H, D and M, wherein, create even numbers group and specifically comprise from relevant search feedback dictionary:
1) Trie tree is created according to each search need character string;
2) create initial even numbers group, comprising:
Integer array base [] and check [] is formed;
3) in being set by Trie, the child nodes of root node adds queue, wherein, sorts from big to small to the child nodes number that the element in queue has according to it;
4) first element of queue is taken out;
5) if this element has child nodes, then to the character value B1 of all child nodes of this element, B2 ... Bn, get one and meet check [H+B1], check [H+B2] ... check [H+Bn] is the value H of 0, once find this H, base [i]=H is then set, check [H+B1]=check [H+B2]=...=check [H+Bn]=i, wherein, H+B1, H+B2 ... H+Bn is the subscript position of current child nodes in even numbers group, i is the subscript position of the father node of current child nodes, if the ending character of this element representation entry, base [i]=-H is then set,
6) if this element does not have child nodes, then base [i]=-i is set;
7) step 4-6 is repeated, until all elements all takes in queue;
Create attribute array to comprise:
1) history word frequency attribute T, H, D and M of statistical query word;
2) create initial attribute array, length equals the length of even numbers group;
3) all search need character strings are traveled through, return its array index value i in even numbers group;
4) be point to the pointer of T, H, D and M to attribute array [i] assignment.
The present embodiment realizes easily via computer programming.The present embodiment reconstructs while dictionary utilizing even numbers group, and add a parallel attribute array, parallel is say that the length of this array increased is the same with even numbers group.While utilizing even numbers group to judge whether this entry, according to subscript position, in the parallel array increased, obtain relative attribute.
Preferably, step S40 comprises: the relevant in order search listing of each query word is merged into a relevant search list; Heapsort is carried out in the relevant search list be combined; User is submitted in relevant search list after heapsort.
In an embodiment of the present invention, as follows to the heapsort method flow of the relevant in order search listing of each query word:
1) there will be a known several arrays, each array is made up of the relevant in order search listing of each query word.
2) operating value getting the current mark bit of each array builds a heap.Time initial, current mark bit points to first element of array, and the heap built is unordered.
3) utilize Heap algorithm to sort to this heap, get heap top operating value and put into results set.And move after the current mark bit of the array at operating value place, heap top.
4) second step is repeated, until all array current mark bit point to last element of this array.
The present embodiment realizes easily via computer programming.
Preferably, user is submitted in the relevant search list after heapsort to comprise: duplicate removal process is carried out in the relevant search list after heapsort; Submit the top n search need character string in the relevant search list after duplicate removal process to user, N is default integer.Such as obtained in relevant search, including AB search need by query word A, and in the relevant search obtained by query word B, also include AB search need, then can there is the search need of repetition in lists, removing the relevant search demand repeated can make result more accurate, to meet the individual demand of user.
Fig. 4 shows the schematic diagram of the implement device of the relevant search according to the embodiment of the present invention, comprising:
Acquisition module 10, for obtaining the search need character string of user's input;
Word-dividing mode 20, for obtaining multiple queries word by search need character string participle;
List block 30, for obtaining the relevant in order search listing of each query word in inverted index;
Submit module 40 to, for the relevant in order search listing of each query word is returned to user.
This device adopts inverted structure, can provide relevant search expeditiously to user.
Obviously, those skilled in the art should be understood that, above-mentioned of the present invention each module or each step can realize with general calculation element, they can concentrate on single calculation element, or be distributed on network that multiple calculation element forms, alternatively, they can realize with the executable program code of calculation element, thus, they can be stored and be performed by calculation element in the storage device, or they are made into each integrated circuit modules respectively, or the multiple module in them or step are made into single integrated circuit module to realize.Like this, the present invention is not restricted to any specific hardware and software combination.
The foregoing is only the preferred embodiments of the present invention, be not limited to the present invention, for a person skilled in the art, the present invention can have various modifications and variations.Within the spirit and principles in the present invention all, any amendment done, equivalent replacement, improvement etc., all should be included within protection scope of the present invention.