CN102855252B - A kind of need-based data retrieval method and device - Google Patents

A kind of need-based data retrieval method and device Download PDF

Info

Publication number
CN102855252B
CN102855252B CN201110181722.8A CN201110181722A CN102855252B CN 102855252 B CN102855252 B CN 102855252B CN 201110181722 A CN201110181722 A CN 201110181722A CN 102855252 B CN102855252 B CN 102855252B
Authority
CN
China
Prior art keywords
requirement description
user query
semantic vector
data resource
similarity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201110181722.8A
Other languages
Chinese (zh)
Other versions
CN102855252A (en
Inventor
施少杰
刘建柱
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201110181722.8A priority Critical patent/CN102855252B/en
Publication of CN102855252A publication Critical patent/CN102855252A/en
Application granted granted Critical
Publication of CN102855252B publication Critical patent/CN102855252B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a kind of need-based data retrieval method and device, set up respectively in advance and store the semantic vector of requirement description keyword corresponding to each data resource; Selection user search request (query) and the similarity of the semantic vector of each requirement description keyword meet the data resource corresponding to requirement description keyword that default similarity requires; The retrieval for this user query is carried out in the data resource selected.In terms of existing technologies, the present invention can recall the Search Results more accurately reflecting user's request, avoids and repeatedly retrieves, and saves retrieve resources.

Description

A kind of need-based data retrieval method and device
[technical field]
The present invention relates to field of computer technology, particularly a kind of need-based data retrieval method and device.
[background technology]
User is when utilizing search engine obtaining information, demand is clearer and more definite under many circumstances, and search engine should for searching for the information of mating with this query in the search word (query) of user's input to data resource corresponding to demand and returning to user.Wherein, when determining data resource corresponding to demand, requirement description keyword (key) corresponding with data resource for query is mated, but the requirement description key that data resource is corresponding is single often, this just needs, and user query is consistent with the form of presentation of requirement description key just can find corresponding data resource, but the user query that user uses when expressing same demand is diversified, this just may cause the data resource adopted during search inaccurate, and then the Search Results returned is inaccurate.
Such as, resource is logined for mailbox, there is " 163 mailbox " requirement description key, user can only input the on all four user query with requirement description key just can recall Search Results accurately, if the user query of input is the query such as " free Netease mailbox ", " 163 mailboxes log in " just may cannot recall Search Results accurately.
The problems referred to above are particularly outstanding in structured data searching, and structural data resource is generally darknet resource, need external resource to provide.External resource provides single requirement description key when providing structural data resource.Such as, there is provided the requirement description key of the structural data resource of Weather information for " weather forecast ", if when the user query of input is " nearest weather how ", may just cannot be mapped to provides in the structural data resource of Weather information, thus the Search Results of accurately reflection user's request cannot be recalled, user can only repeatedly attempt inputting query, thus causes the wasting of resources.
[summary of the invention]
In view of this, the invention provides a kind of need-based data retrieval method and device, so that recall the Search Results more accurately reflecting user's request, economize on resources.
Concrete technical scheme is as follows:
A kind of need-based data retrieval method, sets up respectively in advance and stores the semantic vector of requirement description keyword corresponding to each data resource; Described method comprises:
A, the data resource corresponding to requirement description keyword selecting the similarity of the semantic vector of user search request query and each requirement description keyword to meet default similarity to require;
The retrieval for described user query is carried out in B, the data resource selected in described steps A.
Particularly, the semantic vector setting up requirement description keyword corresponding to described each data resource comprises with at least one under type:
The semantic vector of the requirement description keyword of data resource described in Composition of contents corresponding to appointment label is extracted from the descriptor of described data resource;
The Search Results title that the requirement description keyword utilizing described data resource corresponding is corresponding forms the semantic vector of the requirement description keyword of described data resource; And,
The synonym of the requirement description keyword utilizing described data resource corresponding forms the semantic vector of the requirement description keyword of described data resource.
Wherein, the semantic vector that the Search Results title that the requirement description keyword utilizing described data resource corresponding is corresponding forms the requirement description keyword of described data resource specifically comprises:
S1, the requirement description keyword utilizing described data resource corresponding are searched for;
S2, obtain and come the title of a front N1 Search Results, described N1 is default positive integer;
S3, the title that step S2 obtains is formed the semantic vector of the requirement description keyword of described data resource, or, after the title that step S2 obtains is carried out word segmentation processing, extract the semantic vector that word that the word frequency TF-rate of falling document IDF in the word obtained after word segmentation processing meets preset requirement forms the requirement description keyword of described data resource.
Described steps A specifically comprises:
A11, receive user query after, calculate the similarity of the semantic vector of described user query and each requirement description keyword respectively;
A12, selection similarity meet the data resource corresponding to requirement description keyword that default first similarity requires.
Or described steps A specifically comprises:
A21, receive user query after, search the mapping relations of user query and the requirement description keyword set up in advance, wherein said mapping relations are after the similarity of the semantic vector calculating each user query and each requirement description keyword in search daily record, select similarity to meet requirement description keyword that default second similarity requires and user query foundation;
A22, the data resource corresponding to requirement description keyword selecting described user query to be mapped to.
Wherein, the similarity calculating the semantic vector of user query and requirement description keyword specifically comprises:
C1, determine every content that user query hits in semantic vector to calculate the ratio of the hit length of user query and the length of user query in every content of described hit respectively; And/or, calculate the similarity between the semantic vector of user query and the semantic vector of each requirement description keyword;
C2, the result of calculation of described step C1 is carried out merging treatment, obtain the similarity of the semantic vector of user query and requirement description keyword.
Particularly, the foundation of the semantic vector of described user query comprises with at least one under type:
The Search Results title of user query is utilized to form the semantic vector of this user query; And,
The synonym of user query is utilized to form the semantic vector of this user query.
Preferably, also comprised before the semantic vector setting up requirement description keyword corresponding to each data resource: the requirement description keyword corresponding to each data resource carries out pre-service;
Also comprised before described steps A: pre-service is carried out to described user query;
Described pre-service at least comprises one of following process: convert default upper case or lower case form to, and, convert default coding form to.
A kind of need-based data searcher, this device comprises:
Semantic vector maintenance unit, for setting up respectively and storing the semantic vector of requirement description keyword corresponding to each data resource;
Request reception unit, for receiving user search request query;
Demand recognition unit, meets the data resource corresponding to requirement description keyword that default similarity requires for the similarity of the semantic vector selecting described user query and each requirement description keyword;
Retrieval processing unit, for carrying out the retrieval for described user query in the data resource of described demand recognition unit selection.
Particularly, described semantic vector maintenance unit comprises: primary vector safeguards subelement, secondary vector safeguards subelement and the 3rd vector is safeguarded in subelement at least one, and vectorial storing sub-units;
Primary vector safeguards subelement, for extracting the semantic vector of the requirement description keyword of data resource described in Composition of contents corresponding to appointment label in the descriptor from described data resource, and is supplied to described vectorial storing sub-units;
Secondary vector safeguards subelement, and the Search Results title that the requirement description keyword for utilizing described data resource corresponding is corresponding forms the semantic vector of the requirement description keyword of described data resource, and is supplied to described vectorial storing sub-units;
3rd vector safeguards subelement, and the synonym for the requirement description keyword utilizing described data resource corresponding forms the semantic vector of the requirement description keyword of described data resource, and is supplied to described vectorial storing sub-units;
Described vectorial storing sub-units, for storing the semantic vector of the requirement description keyword received.
Wherein, described secondary vector safeguards that subelement obtains the title coming a front N1 Search Results in Search Results corresponding to requirement description keyword corresponding to described data resource, the title of acquisition is formed the semantic vector of the requirement description keyword of described data resource, or, after the title of acquisition is carried out word segmentation processing, extract the semantic vector that word that the word frequency TF-rate of falling document IDF in the word obtained after word segmentation processing meets preset requirement forms the requirement description keyword of described data resource, described N1 is default positive integer.
Described demand recognition unit specifically comprises: Similarity Measure subelement and first resource chooser unit;
The user query received is supplied to described Similarity Measure subelement by described request receiving element;
Described Similarity Measure subelement, for calculating the similarity of the semantic vector of each requirement description keyword that the user query that receives and described semantic vector maintenance unit are safeguarded;
Described first resource chooser unit, for the result of calculation according to described Similarity Measure subelement, selects similarity to meet the data resource corresponding to requirement description keyword of default first similarity requirement.
Or described demand recognition unit specifically comprises: subelement is chosen in daily record, Similarity Measure subelement, mapping relations safeguard subelement and Secondary resource chooser unit;
Subelement is chosen in described daily record, for obtaining the user query in search daily record, and is supplied to described Similarity Measure subelement;
Described Similarity Measure subelement, for calculating the similarity of the semantic vector of each requirement description keyword that the user query that receives and described semantic vector maintenance unit are safeguarded;
Described mapping relations safeguard subelement, for the result of calculation according to described Similarity Measure subelement, select similarity meet default second similarity require requirement description keyword and user query set up mapping relations;
Described Secondary resource chooser unit, the data resource corresponding to the requirement description keyword that the user query received for selecting described request receiving element is mapped to.
Wherein, described Similarity Measure subelement specifically comprises:
Similarity calculation module, for determining every content that user query hits in semantic vector, calculates the ratio of the hit length of user query and the length of user query in every content of described hit respectively; And/or, calculate the similarity between the semantic vector of user query and the semantic vector of each requirement description keyword;
Result merges module, for the result of calculation of described similarity calculation module is carried out merging treatment, obtains the similarity of the semantic vector of user query and requirement description keyword.
Further, described Similarity Measure subelement also comprises: query vector sets up module, for the semantic vector utilizing the Search Results title of user query to form this user query, and/or, utilize the synonym of user query to form the semantic vector of this user query.
Preferably, this device also comprises:
Pretreatment unit, is supplied to described semantic vector maintenance unit after carrying out pre-service for the requirement description keyword corresponding to each data resource, is supplied to described demand recognition unit after the user query received described request receiving element carries out pre-service;
Wherein said pre-service at least comprises one of following process: convert default upper case or lower case form to, and, convert default coding form to.
As can be seen from the above technical solutions, the present invention is by setting up the semantic vector of requirement description key corresponding to each data resource in advance, calculate the mode of the similarity of user query and each semantic vector again, can determine that the data resource corresponding to demand of user query carries out the retrieval for this user query, to compare prior art, the Search Results more accurately reflecting user's request can be recalled, user need not repeatedly be inputted, and as far as possible corresponding with the data resource requirement description key of user query is consistent meets search need, avoid and repeatedly retrieve, save retrieve resources.
[accompanying drawing explanation]
The main method process flow diagram that Fig. 1 provides for the embodiment of the present invention;
The method flow diagram of the selection data resource that Fig. 2 provides for the embodiment of the present invention two;
The method flow diagram of the selection data resource that Fig. 3 provides for the embodiment of the present invention three;
A kind of structure drawing of device that Fig. 4 provides for the embodiment of the present invention four;
The another kind of structure drawing of device that Fig. 5 provides for the embodiment of the present invention four.
[embodiment]
In order to make the object, technical solutions and advantages of the present invention clearly, describe the present invention below in conjunction with the drawings and specific embodiments.
Main method provided by the invention can as shown in Figure 1, comprise the following steps:
Step 101: set up respectively in advance and store the semantic vector of requirement description key corresponding to each data resource.
Step 102: the data resource corresponding to requirement description key selecting the similarity of semantic vector of user query and each requirement description key to meet default similarity to require.
Step 103: carry out the retrieval for this user query in the data resource that step 102 is selected.
First, the process being set up the semantic vector of each requirement description key by embodiment for a pair is described in detail.
Embodiment one,
When setting up the semantic vector of requirement description key corresponding to each data resource, first need to carry out pre-service to requirement description key.Wherein pre-service can comprise at least one in following process: convert default upper case or lower case form to, and convert default coding form to.
Carrying out pre-service to requirement description key is to make the semantic vector of user query and requirement description key be consistent in form, with the similarity between the semantic vector facilitating subsequent calculations user query and requirement description key.The user query and requirement description key that can make an appointment unifies to adopt patterns of capitalization or unification to adopt lowercase versions, also the user query and requirement description key that can make an appointment adopts unified coding form, such as unified employing Chinese character international proliferation code (GBK).
The semantic vector of requirement description key builds and can adopt but be not limited to under type:
First kind of way: the semantic vector extracting the requirement description key of this data resource of Composition of contents corresponding to appointment label from the descriptor of data resource.
Can be there is corresponding descriptor in each data resource, be usually presented as extend markup language (XML) data, wherein contains all keywords of this data resource and descriptor corresponding to keyword.The requirement description key of this descriptor and data resource exists semantically and associates, and this descriptor therefore can be utilized to carry out expansion to requirement description key and form semantic vector.These descriptors all have some labels (tag) in advance, can specify some tag in advance, extract the semantic vector of the requirement description key of this data resource of Composition of contents corresponding to appointment tag.
Tag is wherein specified to include but not limited to: content (content), title (title), subhead (smalltitle), button text (buttontext), form caption (formtitle), description (description), mailbox ending (emailtail) or linked contents (linktext).
The second way: the Search Results title (title) that the requirement description key utilizing data resource corresponding is corresponding forms the semantic vector of the requirement description key of this data resource.
Requirement description key can be utilized to search for, and usual Search Results carries out sorting according to the degree of correlation with requirement description key, this can obtain come before the title of N1 Search Results, wherein N1 is default positive integer.These title and requirement description key obtained exist semantically and associate, and can extract content carries out expanding formation requirement description key semantic vector to requirement description key from these title.
For the title obtained, directly these title obtained can be used for the semantic vector forming requirement description key.Such as, from the Search Results of requirement description key, get the Search Results title coming first 20 carry out requirement description key expanding the semantic vector forming requirement description key.
Also can carry out word segmentation processing respectively to the title obtained, in the word obtained after extracting word segmentation processing, word frequency-rate of falling document (TF-IDF) meets the semantic vector of the word formation requirement description key of preset requirement.TF-IDF is in order to assess the significance level of a word for a copy of it file in a corpus, and the word obtained just be presented as participle in the manner after is for the significance level of the title of acquisition above-mentioned in Large Scale Corpus.The frequency that its significance level occurs in the title obtained to word is directly proportional, and the frequency occurred in Large Scale Corpus with this word is inversely proportional to.Wherein the statistical of TF-IDF is prior art, does not repeat them here.
When extracting the semantic vector of word formation requirement description key in the word obtained after word segmentation processing, can extract the word that TF-IDF reaches setting threshold value, also can extract the word that TF-IDF comes front N2, N2 is default positive integer.
The third mode: the synonym of the requirement description key utilizing data resource corresponding forms the semantic vector of the requirement description key of this data resource.
By inquiry synonymicon, the synonym of requirement description key can be obtained in this approach, utilize the synonym obtained to carry out requirement description key expanding the semantic vector forming requirement description key.
If adopt above-mentioned three kinds of modes simultaneously, then can comprise in the semantic vector of requirement description key: the word in the Search Results title that the content of specifying tag corresponding in this requirement description key, descriptor, requirement description key are corresponding or Search Results title, the synonym of requirement description key.
When selecting to carry out retrieving used data resource for user query, can adopt in two ways, being introduced respectively respectively by embodiment two and embodiment three below.
Embodiment two,
The method flow diagram of the selection data resource that Fig. 2 provides for the embodiment of the present invention two, as shown in Figure 2, the method can comprise the following steps:
Step 201: receive user query.
This user query is the query of user's input that search engine receives.
Step 202: the similarity calculating the semantic vector of the requirement description key of user query and each data resource.
This step, when calculating similarity, can perform following steps S1 and S2 for the semantic vector of each requirement description key respectively:
S1: user query is mated with the semantic vector of requirement description key, determine every content that user query hits in this semantic vector then to calculate the ratio of the hit length of user query and the length of user query in every content of hit respectively; And/or, calculate the similarity between the semantic vector of user query and the semantic vector of requirement description key.
If the semantic vector of requirement description key comprises this requirement description key itself, then can calculate the ratio that user query hits the length of this requirement description key and the total length of this user query.
Wherein, if the semantic vector of requirement description key builds the first kind of way that have employed in embodiment one, then in every content of above-mentioned hit, the hit length of user query with the ratio of the length of user query is exactly: the ratio of the length of hitting in the content that the word in user query specifies tag corresponding in the descriptor of data resource and the total length of user query.Hit described in the present embodiment refers to identical with the word in user query.
Wherein preferably, can be the different weighted value of content assignment that above-mentioned appointment tag is corresponding according to the significance level of content corresponding to appointment tag above-mentioned in semantic vector further, then the above-mentioned ratio calculated is carried out linear weighted function by exploitation right weight values, thus obtains the similarity of this part.
If the semantic vector of requirement description key builds the second way that have employed in embodiment one, then in every content of above-mentioned hit, the ratio of the hit length of user query and the length of user query is exactly: the ratio of the hit length in the word of the Search Results title that the word in user query comprises at semantic vector or Search Results title and the total length of user query.
If the semantic vector of requirement description key builds the third mode that have employed in embodiment one, then in every content of above-mentioned hit, the ratio of the hit length of user query and the length of user query is exactly: the ratio of the length of hitting in the synonym of the requirement description key that the word in user query comprises at semantic vector and the total length of user query.
Before the calculating carrying out above-mentioned ratio, first carry out word segmentation processing to user query, this part word segmentation processing technology can adopt prior art, repeats no more.
In this step, also can calculate the similarity between the semantic vector of user query and the semantic vector of each requirement description key, Similarity Measure can adopt but be not limited to cosine similarity account form.Wherein the foundation of the semantic vector of user query can adopt but be not limited to at least one under type:
First kind of way: utilize the Search Results title of user query to form the semantic vector of user query.
User query can be utilized to carry out searching for (namely carrying out General Page search), usual Search Results carries out sorting according to the degree of correlation with user query, this can obtain come before the title of N3 Search Results, wherein N3 is default positive integer.These title and user query obtained exist semantically and associate, and can extract content carries out expanding formation user query semantic vector to user query from these title.
For the title obtained, directly the title of acquisition can be used for the semantic vector forming user query.Such as, from the Search Results of user query, the semantic vector that the Search Results title coming first 20 forms user query is got.
Also can carry out word segmentation processing respectively to the title obtained, in the word obtained after extracting word segmentation processing, TF-IDF meets the semantic vector of the word formation user query of preset requirement.Wherein preset requirement can reach setting threshold value for TF_IDF or TF-IDF comes front N4, and N4 is default positive integer.
The second way: utilize the synonym of user query to form the semantic vector of user query.
By inquiry thesaurus, the synonym of user query can be obtained in this approach, utilize the synonym obtained to carry out user query expanding the semantic vector forming user query.
In addition, before the semantic vector setting up user query, also can carry out pre-service to user query, this pretreated mode is identical with to the pretreatment mode of requirement description key, convert default upper case or lower case form to, and/or, convert default coding form to.
S2: the result of calculation of step S1 is carried out merging treatment, obtains the similarity of the semantic vector of user query and requirement description key.
The mode of linear weighted function can be adopted at this, pre-set the weights of each result of calculation, above-mentioned each result of calculation be carried out the similarity of the result after linear weighted function as the semantic vector of user query and requirement description key.
Step 203: select similarity to meet the data resource corresponding to requirement description key of default similarity requirement.
In this step, can select the data resource of sequencing of similarity corresponding to the requirement description key of top n, wherein N is default positive integer; Also similarity can be selected to reach data resource corresponding to the requirement description key of default similarity threshold.
Step 204: carry out the retrieval for user query in the data resource that step 203 is selected.
Embodiment three,
The method flow diagram of the selection data resource that Fig. 3 provides for the embodiment of the present invention three, as shown in Figure 3, the method can comprise the following steps:
Step 301: receive user query.
This user query is the query of user's input that search engine receives.
Step 302: the mapping relations of searching user query and the requirement description key set up in advance, wherein said mapping relations are after the similarity of the semantic vector calculating each user query and each requirement description key in search daily record, select similarity to meet the requirement description key of default similarity requirement and user query and set up.
In the present embodiment, adopt the mode setting up the mapping relations of user query and requirement description key in advance, after receiving user query, directly search these mapping relations, determine the requirement description key that user query is mapped to.
When setting up the mapping relations of user query and requirement description key, each user query can be obtained in advance from search daily record, calculate the similarity between itself and the semantic vector of requirement description key for each user query respectively, select similarity meet default similarity require user query and requirement description key set up mapping relations.Such as, select similarity to reach the user query of default similarity threshold and requirement description key and set up mapping relations.
Similarity Measure mode between the semantic vector of wherein user query and requirement description key is see the description in the step 202 of embodiment two.
Step 303: select the data resource corresponding to requirement description key that the user query received is mapped to.
Step 304: carry out the retrieval for this user query in the data resource that step 303 is selected.
Embodiment two adopts the mode calculating similarity between itself and the semantic vector of requirement description key in real time to determine to search for the data resource used after receiving user query, embodiment three is the similarities in online lower calculating search daily record in advance between each user query and the semantic vector of requirement description key, thus the mapping relations set up between each user query and requirement description key, after receiving user query, mapping relations are utilized to determine to search for the data resource used.
Lift a concrete example, suppose certain " Netease ", " mailbox ", " email ", " free email box ", " Netease's mailbox " requirement description key be " 163 mailbox ".
Extract from the descriptor of this mailbox logon resource: " the free postal of 163 Neteases ", " Netease's free email box ", " registration mailbox ", " 163.com ".
After utilizing requirement description key to search for, extract from Search Results title: " Netease's free email box-Chinese first E-mail server ", " Netease 163 ", " charged mailbox of Netease VIP mailbox-most safety and stability ".
The synonym extracting requirement description key is: " Netease ", " mailbox ", " email ", " free email box ", " Netease's mailbox ".
The semantic vector of the Composition of contents requirement description key utilizing said extracted to go out, this semantic vector comprises: " 163 mailbox ", " the free postal of 163 Neteases ", " Netease's free email box ", " registration mailbox ", " 163.com ", " Netease's free email box-Chinese first E-mail server ", " Netease 163 ", " charged mailbox of Netease VIP mailbox-most safety and stability ", " Netease ", " mailbox ", " email ", " free email box ", " Netease's mailbox ".
In the mode shown in embodiment three, from search daily record, suppose that the user query got comprises " free 163 mailboxes ", " 163 mailboxes log in " etc.
Calculate the similarity of each user query and above-mentioned semantic vector respectively, when calculating similarity, for user query " free 163 mailboxes ", calculate the ratio of the hit length of user query and the length of user query in every content of this user query hit respectively.Such as, for " Netease's free email box " in semantic vector, its hit length is the length of " freely " and " mailbox " two words, determines the ratio of the length of this hit length and user query " free 163 mailboxes ".Successively ratio calculation is carried out to the content in semantic vector in this manner, finally according to the weights preset, linear weighted function is carried out to each result of calculation, obtain the similarity between this user query and above-mentioned semantic vector.
Suppose to calculate " free 163 mailboxes ", similarity between " 163 mailboxes log in " and above-mentioned semantic vector all meets the similarity requirement preset, then set up the mapping relations between " free 163 mailboxes " and requirement description key " 163 mailbox ", and the mapping relations between " 163 mailboxes log in " and requirement description key " 163 mailbox ".
If receive user query for " free 163 mailboxes ", then search the mapping relations set up in advance, determine that this user query is mapped to requirement description key " 163 mailbox ", then the mailbox logon resource corresponding to this " 163 mailbox " carries out the retrieval for user query " free 163 mailboxes ".
Be more than the description that method provided by the present invention is carried out, below by embodiment four, device provided by the present invention be described in detail.
Embodiment four,
The structure drawing of device that Fig. 4 provides for the embodiment of the present invention four, this device can be arranged on the server end at search engine place.As shown in Figure 4, this device can comprise: semantic vector maintenance unit 400, request reception unit 410, demand recognition unit 420 and retrieval processing unit 430.
Semantic vector maintenance unit 400 is set up respectively and is stored the semantic vector of requirement description key corresponding to each data resource.
Request reception unit 410 receives user query.
The data resource corresponding to requirement description key that demand recognition unit 420 is selected the similarity of the semantic vector of user query and each requirement description key to meet default similarity to require.
The retrieval for user query is carried out in the data resource that retrieval processing unit 430 is selected at demand recognition unit 420.
Wherein, semantic vector maintenance unit 400 can specifically comprise: primary vector safeguards subelement 401, secondary vector safeguards subelement 402 and the 3rd vector safeguards at least one (to comprise three subelements simultaneously in Fig. 4 and Fig. 5) in subelement 403 and vectorial storing sub-units 404.
Primary vector safeguards that subelement 401 extracts the semantic vector of the requirement description key of Composition of contents data resource corresponding to appointment tag from the descriptor of data resource, and is supplied to vectorial storing sub-units 404.
Can be there is corresponding descriptor in each data resource, be usually presented as XML data, wherein contains all keywords of this data resource and descriptor corresponding to keyword.The requirement description key of this descriptor and data resource exists semantically and associates, and this descriptor therefore can be utilized to carry out expansion to requirement description key and form semantic vector.These descriptors all have some tag in advance, can specify some tag in advance, extract the semantic vector of the requirement description key of this data resource of Composition of contents corresponding to appointment tag.
Tag is wherein specified to include but not limited to: content, title, smalltitle, buttontext, formtitle, de scription, emailtail or linktext.
Secondary vector safeguards the semantic vector of the requirement description key of the Search Results title composition data resource that requirement description key that subelement 402 utilizes data resource corresponding is corresponding, and is supplied to vectorial storing sub-units 404.
Requirement description key can be utilized to search for, and usual Search Results carries out sorting according to the degree of correlation with requirement description key, this can obtain come before the title of N1 Search Results, wherein N1 is default positive integer.These title and requirement description key obtained exist semantically and associate, and can extract content carries out expanding formation requirement description key semantic vector to requirement description key from these title.
Secondary vector safeguards that subelement 402 can obtain the title coming a front N1 Search Results in Search Results corresponding to requirement description key corresponding to data resource, by the semantic vector of the requirement description key of the title composition data resource of acquisition, or, after the title of acquisition is carried out word segmentation processing, in the word obtained after extracting word segmentation processing, the word frequency TF-rate of falling document IDF meets the semantic vector of the requirement description key of the word composition data resource of preset requirement, and N1 is default positive integer.
3rd vector safeguards the semantic vector of the requirement description key of the synonym composition data resource of the requirement description key that subelement 403 utilizes data resource corresponding, and is supplied to vectorial storing sub-units 404.
Specifically by inquiry synonymicon, the synonym of requirement description key can be obtained, utilize the synonym obtained to carry out requirement description key expanding the semantic vector forming requirement description key.
The semantic vector of the requirement description key received stores by vector storing sub-units 404, if semantic vector maintenance unit 400 comprises simultaneously, primary vector safeguards subelement 401, secondary vector safeguards subelement 402 and the 3rd vector safeguards subelement 403, then the semantic vector of the requirement description key of vectorial storing sub-units 404 storage can comprise: the word in the Search Results title that the content of specifying tag corresponding in this requirement description key, descriptor, requirement description key are corresponding or Search Results title, the synonym of requirement description key.
Correspond respectively to the mode shown in above-described embodiment two and embodiment three, demand recognition unit 420 can adopt two kinds of structures:
As shown in Figure 4, demand recognition unit 420 can specifically comprise the first structure: Similarity Measure subelement 421 and first resource chooser unit 422.
Under this structure, the user query received is supplied to Similarity Measure subelement 421 by request reception unit 410.
Similarity Measure subelement 421 calculates the similarity of the semantic vector of each requirement description key that the user query that receives and semantic vector maintenance unit 400 are safeguarded.
First resource chooser unit 422, according to the result of calculation of Similarity Measure subelement 421, selects similarity to meet the data resource corresponding to requirement description key of default first similarity requirement.
As shown in Figure 5, demand recognition unit 420 specifically comprises the second structure: subelement 521 is chosen in daily record, Similarity Measure subelement 522, mapping relations safeguard subelement 523 and Secondary resource chooser unit 524.
Daily record is chosen subelement 521 and is obtained the user query searched in daily record, and is supplied to Similarity Measure subelement 522.
Similarity Measure subelement 522 calculates the similarity of the semantic vector of each requirement description key that the user query that receives and semantic vector maintenance unit 400 are safeguarded.
Mapping relations safeguard the result of calculation of subelement 523 according to Similarity Measure subelement 522, select similarity meet default second similarity require requirement description key and user query set up mapping relations.
The data resource corresponding to requirement description key that the user query that Secondary resource chooser unit 524 selects request reception unit 410 to receive is mapped to.
Wherein, the Similarity Measure subelement 421 in Fig. 4 and the Similarity Measure subelement 522 in Fig. 5 can specifically comprise: similarity calculation module and result merge module, not shown in Fig. 4 and Fig. 5.
Wherein, similarity calculation module is determined every content that user query hits in semantic vector to calculate the ratio of the hit length of user query and the length of user query in every content of hit respectively; And/or, calculate the similarity between the semantic vector of user query and the semantic vector of each requirement description key.
Result merges module and the result of calculation of similarity calculation module is carried out merging treatment, obtains the similarity of the semantic vector of user query and requirement description key.
In addition, Similarity Measure subelement can also comprise: query vector sets up module, for the semantic vector utilizing the Search Results title of user query to form this user query, and/or, utilize the synonym of user query to form the semantic vector of this user query.
As shown in Figures 4 and 5, the conveniently Similarity Measure of the semantic vector of user query and requirement description key, preferably converts user query and requirement description key to unified form, and Given this, this device can also comprise: pretreatment unit 440.
Be supplied to semantic vector maintenance unit 400 after the requirement description key that pretreatment unit 440 is corresponding to each data resource carries out pre-service, after the user query received request reception unit 410 carries out pre-service, be supplied to demand recognition unit 420.
Wherein pre-service can at least comprise one of following process: convert default upper case or lower case form to, and, convert default coding form to.
Said method or device can be used in structured data searching, are the structural data resource that user query selects to meet consumers' demand by said method or device.Follow-up get result for retrieval after, the Search Results of structural data resource acquisition can be returned to user, also can merge ordinary pages search Search Results.When displaying searching result, arbitrary ways of presentation can be adopted, preferably, represent before the Search Results of structural data resource acquisition can being come.
The foregoing is only preferred embodiment of the present invention, not in order to limit the present invention, within the spirit and principles in the present invention all, any amendment made, equivalent replacement, improvement etc., all should be included within the scope of protection of the invention.

Claims (16)

1. identifying a method for search need, be applied to structured data searching, there is corresponding requirement description keyword in the structural data resource that external resource provides; It is characterized in that, set up the semantic vector of requirement description keyword corresponding to each structural data resource in advance respectively; Described method comprises:
A, the structural data resource corresponding to requirement description keyword selecting the similarity of the semantic vector of user search request query and each requirement description keyword to meet default similarity to require;
Carrying out the retrieval for described user query in B, the structural data resource selected in described steps A, returning to described user by retrieving the Search Results obtained.
2. method according to claim 1, is characterized in that, the described semantic vector setting up requirement description keyword corresponding to each structural data resource comprises with at least one under type:
The semantic vector of the requirement description keyword of structural data resource described in Composition of contents corresponding to appointment label is extracted from the descriptor of described structural data resource;
Search Results title corresponding to requirement description keyword corresponding to described structural data resource is utilized to form the semantic vector of the requirement description keyword of described structural data resource; And,
The synonym of requirement description keyword corresponding to described structural data resource is utilized to form the semantic vector of the requirement description keyword of described structural data resource.
3. method according to claim 2, is characterized in that, the semantic vector utilizing Search Results title corresponding to requirement description keyword corresponding to described structural data resource to form the requirement description keyword of described structural data resource specifically comprises:
S1, requirement description keyword corresponding to described structural data resource is utilized to search for;
S2, obtain and come the title of a front N1 Search Results, described N1 is default positive integer;
S3, the title that step S2 obtains is formed the semantic vector of the requirement description keyword of described structural data resource, or, after the title that step S2 obtains is carried out word segmentation processing, extract the semantic vector that word that the word frequency TF-rate of falling document IDF in the word obtained after word segmentation processing meets preset requirement forms the requirement description keyword of described structural data resource.
4. method according to claim 1, is characterized in that, described steps A specifically comprises:
A11, receive user query after, calculate the similarity of the semantic vector of described user query and each requirement description keyword respectively;
A12, selection similarity meet the structural data resource corresponding to requirement description keyword that default similarity requires.
5. method according to claim 1, is characterized in that, described steps A specifically comprises:
A21, receive user query after, search the mapping relations of user query and the requirement description keyword set up in advance, wherein said mapping relations are after the similarity of the semantic vector calculating each user query and each requirement description keyword in search daily record, select similarity to meet requirement description keyword that default similarity requires and user query foundation;
A22, the structural data resource corresponding to requirement description keyword selecting described user query to be mapped to.
6. the method according to claim 4 or 5, is characterized in that, the similarity calculating the semantic vector of user query and requirement description keyword specifically comprises:
C1, determine every content that user query hits in semantic vector to calculate the ratio of the hit length of user query and the length of user query in every content of described hit respectively; And/or, calculate the similarity between the semantic vector of user query and the semantic vector of each requirement description keyword;
C2, the result of calculation of described step C1 is carried out merging treatment, obtain the similarity of the semantic vector of user query and requirement description keyword.
7. method according to claim 6, is characterized in that, the foundation of the semantic vector of described user query comprises with at least one under type:
The Search Results title of user query is utilized to form the semantic vector of this user query; And,
The synonym of user query is utilized to form the semantic vector of this user query.
8. method according to claim 1, is characterized in that, also comprises before the semantic vector setting up requirement description keyword corresponding to each structural data resource: the requirement description keyword corresponding to each structural data resource carries out pre-service;
Also comprised before described steps A: pre-service is carried out to described user query;
Described pre-service at least comprises one of following process: convert default upper case or lower case form to, and, convert default coding form to.
9. identifying a device for search need, be applied to structured data searching, there is corresponding requirement description keyword in the structural data resource that external resource provides; It is characterized in that, this device comprises:
Semantic vector maintenance unit, for setting up respectively and storing the semantic vector of requirement description keyword corresponding to each structural data resource;
Request reception unit, for receiving user search request query;
Demand recognition unit, meets the structural data resource corresponding to requirement description keyword that default similarity requires for the similarity of the semantic vector selecting described user query and each requirement description keyword;
Retrieval processing unit, for carrying out the retrieval for described user query in the structural data resource of described demand recognition unit selection, returns to user by retrieving the Search Results obtained.
10. device according to claim 9, is characterized in that, described semantic vector maintenance unit comprises: primary vector safeguards subelement, secondary vector safeguards subelement and the 3rd vector is safeguarded in subelement at least one, and vectorial storing sub-units;
Primary vector safeguards subelement, for extracting the semantic vector of the requirement description keyword of structural data resource described in Composition of contents corresponding to appointment label in the descriptor from described structural data resource, and is supplied to described vectorial storing sub-units;
Secondary vector safeguards subelement, for the semantic vector utilizing Search Results title corresponding to requirement description keyword corresponding to described structural data resource to form the requirement description keyword of described structural data resource, and be supplied to described vectorial storing sub-units;
3rd vector safeguards subelement, for the semantic vector utilizing the synonym of requirement description keyword corresponding to described structural data resource to form the requirement description keyword of described structural data resource, and is supplied to described vectorial storing sub-units;
Described vectorial storing sub-units, for storing the semantic vector of the requirement description keyword received.
11. devices according to claim 10, it is characterized in that, described secondary vector safeguards that subelement obtains the title coming a front N1 Search Results in Search Results corresponding to requirement description keyword corresponding to described structural data resource, the title of acquisition is formed the semantic vector of the requirement description keyword of described structural data resource, or, after the title of acquisition is carried out word segmentation processing, extract the semantic vector that word that the word frequency TF-rate of falling document IDF in the word obtained after word segmentation processing meets preset requirement forms the requirement description keyword of described structural data resource, described N1 is default positive integer.
12. devices according to claim 9, is characterized in that, described demand recognition unit specifically comprises: Similarity Measure subelement and first resource chooser unit;
The user query received is supplied to described Similarity Measure subelement by described request receiving element;
Described Similarity Measure subelement, for calculating the similarity of the semantic vector of each requirement description keyword that the user query that receives and described semantic vector maintenance unit are safeguarded;
Described first resource chooser unit, for the result of calculation according to described Similarity Measure subelement, selects similarity to meet the structural data resource corresponding to requirement description keyword of default similarity requirement.
13. devices according to claim 9, is characterized in that, described demand recognition unit specifically comprises: subelement is chosen in daily record, Similarity Measure subelement, mapping relations safeguard subelement and Secondary resource chooser unit;
Subelement is chosen in described daily record, for obtaining the user query in search daily record, and is supplied to described Similarity Measure subelement;
Described Similarity Measure subelement, for calculating the similarity of the semantic vector of each requirement description keyword that the user query that receives and described semantic vector maintenance unit are safeguarded;
Described mapping relations safeguard subelement, for the result of calculation according to described Similarity Measure subelement, select similarity meet default similarity require requirement description keyword and user query set up mapping relations;
Described Secondary resource chooser unit, the structural data resource corresponding to the requirement description keyword that the user query received for selecting described request receiving element is mapped to.
14. devices according to claim 12 or 13, it is characterized in that, described Similarity Measure subelement specifically comprises:
Similarity calculation module, for determining every content that user query hits in semantic vector, calculates the ratio of the hit length of user query and the length of user query in every content of described hit respectively; And/or, calculate the similarity between the semantic vector of user query and the semantic vector of each requirement description keyword;
Result merges module, for the result of calculation of described similarity calculation module is carried out merging treatment, obtains the similarity of the semantic vector of user query and requirement description keyword.
15. devices according to claim 14, it is characterized in that, described Similarity Measure subelement also comprises: query vector sets up module, for the semantic vector utilizing the Search Results title of user query to form this user query, and/or, utilize the synonym of user query to form the semantic vector of this user query.
16. devices according to claim 9, is characterized in that, this device also comprises:
Pretreatment unit, be supplied to described semantic vector maintenance unit after carrying out pre-service for the requirement description keyword corresponding to each structural data resource, after the user query received described request receiving element carries out pre-service, be supplied to described demand recognition unit;
Wherein said pre-service at least comprises one of following process: convert default upper case or lower case form to, and, convert default coding form to.
CN201110181722.8A 2011-06-30 2011-06-30 A kind of need-based data retrieval method and device Active CN102855252B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201110181722.8A CN102855252B (en) 2011-06-30 2011-06-30 A kind of need-based data retrieval method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201110181722.8A CN102855252B (en) 2011-06-30 2011-06-30 A kind of need-based data retrieval method and device

Publications (2)

Publication Number Publication Date
CN102855252A CN102855252A (en) 2013-01-02
CN102855252B true CN102855252B (en) 2015-09-09

Family

ID=47401845

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201110181722.8A Active CN102855252B (en) 2011-06-30 2011-06-30 A kind of need-based data retrieval method and device

Country Status (1)

Country Link
CN (1) CN102855252B (en)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106815252B (en) * 2015-12-01 2020-08-25 阿里巴巴集团控股有限公司 Searching method and device
CN106959976B (en) * 2016-01-12 2020-08-14 腾讯科技(深圳)有限公司 Search processing method and device
CN106570046A (en) * 2016-03-02 2017-04-19 合网络技术(北京)有限公司 Method and device for recommending relevant search data based on user operation behavior
CN106021346B (en) * 2016-05-09 2020-01-07 北京百度网讯科技有限公司 Retrieval processing method and device
CN108804409A (en) * 2017-04-28 2018-11-13 西安科技大市场创新云服务股份有限公司 A kind of semantic retrieving method and device
CN107885875B (en) * 2017-11-28 2022-07-08 北京百度网讯科技有限公司 Synonymy transformation method and device for search words and server
CN108776901B (en) * 2018-04-27 2021-01-15 微梦创科网络科技(中国)有限公司 Advertisement recommendation method and system based on search terms
CN109213916A (en) * 2018-09-14 2019-01-15 北京字节跳动网络技术有限公司 Method and apparatus for generating information
CN109669978B (en) * 2018-12-13 2021-02-19 中国联合网络通信集团有限公司 Data resource service generation method, device and system
CN110674087A (en) * 2019-09-03 2020-01-10 平安科技(深圳)有限公司 File query method and device and computer readable storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101241512A (en) * 2008-03-10 2008-08-13 北京搜狗科技发展有限公司 Search method for redefining enquiry word and device therefor
CN101251854A (en) * 2008-03-19 2008-08-27 深圳先进技术研究院 Method for creating index lexical item as well as data retrieval method and system
CN101464897A (en) * 2009-01-12 2009-06-24 阿里巴巴集团控股有限公司 Word matching and information query method and device
CN101685448A (en) * 2008-09-28 2010-03-31 国际商业机器公司 Method and device for establishing association between query operation of user and search result

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102073729B (en) * 2011-01-14 2013-03-06 百度在线网络技术(北京)有限公司 Relationship knowledge sharing platform and implementation method thereof

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101241512A (en) * 2008-03-10 2008-08-13 北京搜狗科技发展有限公司 Search method for redefining enquiry word and device therefor
CN101251854A (en) * 2008-03-19 2008-08-27 深圳先进技术研究院 Method for creating index lexical item as well as data retrieval method and system
CN101685448A (en) * 2008-09-28 2010-03-31 国际商业机器公司 Method and device for establishing association between query operation of user and search result
CN101464897A (en) * 2009-01-12 2009-06-24 阿里巴巴集团控股有限公司 Word matching and information query method and device

Also Published As

Publication number Publication date
CN102855252A (en) 2013-01-02

Similar Documents

Publication Publication Date Title
CN102855252B (en) A kind of need-based data retrieval method and device
US9864808B2 (en) Knowledge-based entity detection and disambiguation
CN101467125B (en) Processing of query terms
US8051080B2 (en) Contextual ranking of keywords using click data
KR101040119B1 (en) Apparatus and Method for Search of Contents
CN108038096A (en) Knowledge database documents method for quickly retrieving, application server computer readable storage medium storing program for executing
JP5138046B2 (en) Search system, search method and program
US20130110828A1 (en) Tenantization of search result ranking
CN102200975B (en) Vertical search engine system using semantic analysis
CN107085583B (en) Electronic document management method and device based on content
EP2798540A1 (en) Extracting search-focused key n-grams and/or phrases for relevance rankings in searches
CN104915413A (en) Health monitoring method and health monitoring system
WO2012071169A2 (en) Efficient forward ranking in a search engine
WO2006108069A2 (en) Searching through content which is accessible through web-based forms
US20110184946A1 (en) Applying synonyms to unify text search with faceted browsing classification
CN107844493B (en) File association method and system
US8527518B2 (en) Inverted indexes with multiple language support
CN108804409A (en) A kind of semantic retrieving method and device
CN102486784A (en) Information requesting method and information providing method
CN110674087A (en) File query method and device and computer readable storage medium
CN102024026B (en) Method and system for processing query terms
Wang et al. A semantic query expansion-based patent retrieval approach
CN105824915A (en) Method and system for generating commenting digest of online shopped product
Gupta et al. Document summarisation based on sentence ranking using vector space model
CN111782958A (en) Recommendation word determining method and device, electronic device and storage medium

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant