CN110019646A - A kind of method and apparatus for establishing index - Google Patents

A kind of method and apparatus for establishing index Download PDF

Info

Publication number
CN110019646A
CN110019646A CN201710946410.9A CN201710946410A CN110019646A CN 110019646 A CN110019646 A CN 110019646A CN 201710946410 A CN201710946410 A CN 201710946410A CN 110019646 A CN110019646 A CN 110019646A
Authority
CN
China
Prior art keywords
candidate word
word
semantic item
candidate
mapping relations
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710946410.9A
Other languages
Chinese (zh)
Other versions
CN110019646B (en
Inventor
焦晨晨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jingdong Century Trading Co Ltd
Beijing Jingdong Shangke Information Technology Co Ltd
Original Assignee
Beijing Jingdong Century Trading Co Ltd
Beijing Jingdong Shangke Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jingdong Century Trading Co Ltd, Beijing Jingdong Shangke Information Technology Co Ltd filed Critical Beijing Jingdong Century Trading Co Ltd
Priority to CN201710946410.9A priority Critical patent/CN110019646B/en
Publication of CN110019646A publication Critical patent/CN110019646A/en
Application granted granted Critical
Publication of CN110019646B publication Critical patent/CN110019646B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis

Abstract

The invention discloses a kind of method and apparatus for establishing index, are related to field of computer technology.One specific embodiment of this method includes: that the text sequence of input is divided into multiple words, obtains the mapping relations of multiple words semantic item corresponding with institute's predicate according to the substring of the multiple word and the multiple word;The first index is established according to the mapping relations of the multiple word semantic item corresponding with institute's predicate;The mapping relations of candidate word semantic item corresponding with the candidate word are determined according to the historical search data of user, and judge whether the mapping relations of candidate word semantic item corresponding with the candidate word are correct;The first index according to the modified result of the judgement, to obtain the second index.The embodiment can index in commodity and recall the stage, can either guarantee recall rate to recall commodity relevant to user's search term as much as possible, and can improve accuracy rate to filter out incoherent commodity as much as possible.

Description

A kind of method and apparatus for establishing index
Technical field
The present invention relates to field of computer technology more particularly to a kind of method and apparatus for establishing index.
Background technique
With the development of network technology and logistlcs technology, electric business emerges rapidly in recent years, the commodity amount of electric business platform and For number of species just in the growth of exponentially grade, macrotype platform has billions of commodity, how more to meet efficiently and intelligently The search need of user is very big challenge.The search term of user is varied, the same things there are colloquial style, abbreviation, A variety of different names such as abbreviation, and product name usually compares specification and inclined written word, directly carries out text matches between the two Often go wrong.
At the initial stage of electric business search, commodity number is fewer, and index can directly press character segmentation, and maximum possible guarantees Commodity needed for recalling user, as commodity amount and number of species gradually enrich, this method for not considering word order can be brought Many problems, such as user search " red wine ", and the clothes of the red wine and claret drunk can equally be called back.Therefore just there is elder generation The method indexed again is segmented, such regular collocation is exactly one index chain of entire word when indexing, no such as " dragon fruit " It must all be indexed for " fire " " dragon " " fruit ", substantially increase the accuracy rate that search is recalled.But because the precision of word segmentation and user Search term and product name cannot be matched directly, such as " one-piece dress for lady " by normal participle is " Ms | one-piece dress " in commodity, But user often searches " skirt ", " summer skirt female ", text matches have no idea to recall the commodity.Therefore index end is normally segmenting As a result in, and more fine-grained words are increased, for example increases " female, skirt " again in index field in upper example, user is facilitated to exist Also the commodity of entitled " one-piece dress for lady " can be recalled whens search " skirt ", " summer skirt female " etc..These increased more particulates The word of degree is referred to as " semantic item ", and semantic item is index by Chinese character to by the bridge between glossarial index.
Although being added to some fine-grained word (languages in the word segmentation result in the existing scheme for establishing index for index end The senses of a dictionary entry), but there are still some defects in actual search application, for example, user has been likely to be obtained Gao Le when search is happy high Search result as height, and it is clearly that not meet user expected that when user searches for happy high, which recalls commodity relevant to Gao Legao, 's.
In realizing process of the present invention, at least there are the following problems in the prior art for inventor's discovery:
There is the commodity recalled inaccuracy in existing scheme.
Summary of the invention
In view of this, the embodiment of the present invention provides a kind of method and apparatus for establishing index, it can index and recall in commodity Stage can either guarantee recall rate to recall commodity relevant to user's search term as much as possible, but can improve accuracy rate with Incoherent commodity are filtered out as much as possible.
To achieve the above object, according to an aspect of an embodiment of the present invention, a kind of method for establishing index is provided.
A method of establishing index, comprising: the text sequence of input is divided into multiple words, according to the multiple word and institute The substring for stating multiple words obtains the mapping relations of multiple words semantic item corresponding with institute's predicate;According to the multiple word and institute's predicate pair The mapping relations of semantic item are answered to establish the first index;Determine that candidate word is corresponding with the candidate word semantic by historical search data The mapping relations of item, to correct first rope according to the mapping relations of candidate word semantic item corresponding with the candidate word Draw, obtains the second index.
Optionally, the historical search data includes search term and the text sequence that corresponding user clicks, and passes through history Search data the step of determining the mapping relations of candidate word semantic item corresponding with the candidate word, comprising: to described search word with The text sequence that the user clicks segments, to obtain search term participle and corresponding text sequence participle;According to described There are the search term of inclusion relation participle and text sequence participle are true in search term participle and the corresponding text sequence participle The mapping relations of fixed candidate word semantic item corresponding with the candidate word.
Optionally, the historical search data includes search term and the text sequence that corresponding user clicks, and by institute The acquisition track for stating search term formation, the mapping of candidate word semantic item corresponding with the candidate word is determined by historical search data The step of relationship, comprising: obtain the neologisms being added in first index in advance;From the acquisition track comprising the neologisms In, there are the search terms of inclusion relation with the neologisms for lookup, and there are inclusion relations according to the neologisms and with the neologisms Search term determine the mapping relations of candidate word semantic item corresponding with the candidate word.
Optionally, the historical search data includes search term and the text sequence that corresponding user clicks, and passes through history The step of search data determine the mapping relations of candidate word semantic item corresponding with the candidate word, comprising: acquisition is added in advance Neologisms in first index;Belong to the neologisms according in the neologisms and search term corresponding with particular text sequence The search term of substring determines the mapping relations of candidate word semantic item corresponding with the candidate word, wherein the particular text Sequence be include the neologisms the user click text sequence.
Optionally, the step of the mapping relations of candidate word semantic item corresponding with the candidate word is determined by historical search data After rapid, comprising: judge whether the mapping relations of candidate word semantic item corresponding with the candidate word are correct, including: Judge that the overall of the corresponding semantic item of candidate word in the mapping relations of candidate word semantic item corresponding with the candidate word uses Whether accounting is more than first threshold, if so, the mapping relations of candidate word semantic item corresponding with the candidate word are correct, it is no Then, the mapping relations of candidate word semantic item corresponding with the candidate word are wrong, and the candidate word corresponds to the totality of semantic item The ratio for the use of accounting being N1 and N2, wherein N1 be it is all corresponding with the search term including the corresponding semantic item of the candidate word and In the click volume summation of text sequence relevant with the mapping relations of the corresponding semantic item of the candidate word to the candidate word, by institute It states candidate word and corresponds to the sum of the click volume that semantic item is recalled, N2 is all with the search term including the corresponding semantic item of the candidate word Corresponding text sequence click volume summation.
Optionally, the step of the mapping relations of candidate word semantic item corresponding with the candidate word is determined by historical search data After rapid, comprising: judge whether the mapping relations of candidate word semantic item corresponding with the candidate word are correct, including: Calculate the corresponding semantic item of candidate word in all mapping relations including candidate word semantic item corresponding with the candidate word Among search term, the corresponding candidate word of the maximum preset quantity search term of utilization rate corresponds to semantic item and uses accounting;Sentence Among the maximum preset quantity search term of the utilization rate of breaking, if there are at least one search terms to meet following condition: should The corresponding candidate word of search term corresponds to semantic item and is greater than second threshold using accounting, and the volumes of searches of the search term is greater than third threshold Value, and the corresponding text sequence click volume of the search term is greater than the 4th threshold value, and the corresponding text sequence clicking rate of the search term Greater than the 5th threshold value.If there is at least one described search term, then candidate word semantic item corresponding with the candidate word Mapping relations are correct, and otherwise, the mapping relations of candidate word semantic item corresponding with the candidate word are wrong, each search term pair The candidate word answered correspond to semantic item using accounting be N3 and N4 ratio, wherein N3 be with the search term it is corresponding and with the time In the click volume summation for selecting word text sequence relevant to the mapping relations of the corresponding semantic item of the candidate word, by the candidate word The sum of the click volume that corresponding semantic item is recalled, N4 are text sequence click volume summation corresponding with the search term.
Optionally, the step of the mapping relations of candidate word semantic item corresponding with the candidate word is determined by historical search data After rapid, comprising: judge whether the mapping relations of candidate word semantic item corresponding with the candidate word are correct, including: Judge the corresponding semantic item of candidate word in the mapping relations of candidate word semantic item corresponding with the candidate word is with candidate word It is no similar, if the candidate word corresponds to semantic item, the candidate word with the candidate word corresponding language similar to the candidate word The mapping relations of the senses of a dictionary entry are correct, and otherwise, the mapping relations of candidate word semantic item corresponding with the candidate word are wrong.
Optionally, judge the corresponding language of candidate word in the mapping relations of candidate word semantic item corresponding with the candidate word The senses of a dictionary entry step whether similar to candidate word, comprising: the user obtained in default historical time section searches for data, and the user searches Rope data include the click volume or search for presetting multiple classifications at different levels of search term and user to the text sequence searched Amount;According to the click volume for presetting multiple classifications at different levels that described search word includes when the candidate word corresponds to semantic item Or volumes of searches generate it is described preset the corresponding semantic item vectors of the first candidate words at different levels, and according to described search word include institute The click volume for presetting multiple classifications at different levels when stating candidate word or volumes of searches, which generate, described presets at different levels first Candidate term vector;Calculate separately it is described preset the corresponding semantic item vectors of the first candidate words at different levels and first candidate's term vector it Between vector similarity, described preset multiple primary vector similarities at different levels to obtain respectively corresponding;Judgement is each described Whether primary vector similarity is greater than pre-set the 6th threshold value corresponding with corresponding pre-set level, if so, the candidate It is similar to the candidate word that word corresponds to semantic item, and otherwise, the candidate word corresponds to semantic item and the candidate word is dissimilar.
Optionally, judge the corresponding language of candidate word in the mapping relations of candidate word semantic item corresponding with the candidate word The senses of a dictionary entry step whether similar to candidate word, comprising: the second candidate word is generated by preset model and corresponds to semantic item vector sum the Two candidate term vectors;Calculate second candidate word correspond to described in semantic item vector sum the second candidate term vector vector it is similar Degree, to obtain secondary vector similarity;Judge whether the secondary vector similarity is greater than the 7th threshold value, if so, the candidate It is similar to the candidate word that word corresponds to semantic item, and otherwise, the candidate word corresponds to semantic item and the candidate word is dissimilar.
Optionally, the vector similarity is cosine similarity or Jie Kade similarity.
Optionally, according to the mapping relations of candidate word semantic item corresponding with candidate word amendment first index The step of, comprising: when the mapping relations of candidate word semantic item corresponding with the candidate word are correct: if first rope The mapping relations of candidate word semantic item corresponding with the candidate word are not present in drawing, then are added to the mapping relations described First index;When the mapping relations of candidate word semantic item corresponding with the candidate word are wrong: if first index Middle there are the mapping relations of candidate word semantic item corresponding with the candidate word, then index the mapping relations from described first Middle deletion.
According to another aspect of an embodiment of the present invention, a kind of device for establishing index is provided.
A kind of device for establishing index, comprising: word is to generation module, for the text sequence of input to be divided into multiple words, The mapping relations of multiple words semantic item corresponding with institute's predicate are obtained according to the substring of the multiple word and the multiple word;Index is built Formwork erection block, for establishing the first index according to the mapping relations of the multiple word semantic item corresponding with institute's predicate;Correction module is used In the mapping relations for determining candidate word semantic item corresponding with the candidate word by historical search data, according to the candidate word The mapping relations amendment of semantic item corresponding with the candidate word first index, obtains the second index.
Optionally, the historical search data includes search term and the text sequence that corresponding user clicks, the amendment Module includes the first determining module, is used for: the text sequence clicked to described search word and the user segments, to obtain Search term participle and corresponding text sequence participle;It is deposited according in described search word participle and the corresponding text sequence participle It is segmented in the search term participle and text sequence of inclusion relation and determines reflecting for the candidate word semantic item corresponding with the candidate word Penetrate relationship.
Optionally, the historical search data includes search term and the text sequence that corresponding user clicks, and by institute The acquisition track of search term formation is stated, the correction module includes the second determining module, and be used for: acquisition is added to described the in advance Neologisms in one index;From the acquisition track comprising the neologisms, search with the neologisms there are the search term of inclusion relation, And there are the search terms of inclusion relation to determine the corresponding semantic item of the candidate word and word according to the neologisms and with the neologisms Mapping relations.
Optionally, the historical search data includes search term and the text sequence that corresponding user clicks, the amendment Module includes third determining module, is used for: obtaining the neologisms being added in first index in advance;According to the neologisms and with The search term for the substring for belonging to the neologisms in the corresponding search term of particular text sequence determines the candidate word and the candidate Word corresponds to the mapping relations of semantic item, wherein the particular text sequence be include the neologisms the user click text This sequence.
Optionally, the correction module includes first judgment module, is used for: judging the candidate
Whether semantic item and the mapping relations of word are correct, including: judge that the candidate word is corresponding with the candidate word Whether it is more than first threshold that the candidate word in the mapping relations of semantic item corresponds to the totality of semantic item using accounting, if so, institute The mapping relations for stating candidate word semantic item corresponding with the candidate word are correct, and otherwise, the candidate word is corresponding with the candidate word The mapping relations of semantic item are wrong, and the candidate word corresponds to the ratio that the totality of semantic item is N1 and N2 using accounting, wherein N1 For it is all with include the corresponding semantic item of the candidate word search term it is corresponding and with the candidate word and the corresponding language of the candidate word In the click volume summation of the relevant text sequence of the mapping relations of the senses of a dictionary entry, the click volume that semantic item is recalled is corresponded to by the candidate word The sum of, N2 is all with the corresponding text sequence click volume summation of search term including the corresponding semantic item of the candidate word.
Optionally, institute's predicate includes the second judgment module to judgment module, is used for: judging the candidate word and the candidate Whether the mapping relations that word corresponds to semantic item are correct, including: calculating all includes the candidate word and the candidate word pair The candidate word in the mapping relations of semantic item is answered to correspond among the search term of semantic item, the maximum preset quantity search of utilization rate The corresponding candidate word of word corresponds to semantic item and uses accounting;Judge the maximum preset quantity search term of the utilization rate it In, if there are at least one search terms to meet following condition: the corresponding candidate word of search term corresponds to semantic item and uses accounting Greater than second threshold, and the volumes of searches of the search term is greater than third threshold value, and the corresponding text sequence click volume of the search term is big In the 4th threshold value, and the corresponding text sequence clicking rate of the search term is greater than the 5th threshold value.If there is it is described at least one search Rope word, then the mapping relations of candidate word semantic item corresponding with the candidate word are correct, otherwise, the candidate word and the time The mapping relations for selecting word to correspond to semantic item are wrong, and it is N3 and N4 that the corresponding candidate word of each search term, which corresponds to semantic item using accounting, Ratio, wherein N3 be mapping relations corresponding and with the candidate word and the corresponding semantic item of the candidate word with the search term In the click volume summation of relevant text sequence, the sum of the click volume that semantic item is recalled is corresponded to by the candidate word, N4 is and this The corresponding text sequence click volume summation of search term.
Optionally, institute's predicate includes third judgment module to judgment module, is used for: judging the candidate word and the candidate Whether the mapping relations that word corresponds to semantic item are correct, including: judge candidate word semantic item corresponding with the candidate word Mapping relations in candidate word whether correspond to semantic item similar to candidate word, if the candidate word corresponds to semantic item and the time Select word similar, then the mapping relations of candidate word semantic item corresponding with the candidate word are correct, otherwise, the candidate word and institute State candidate word correspond to semantic item mapping relations it is wrong.
Optionally, the third judgment module includes first similarity judgment module, is used for: obtaining default historical time section Interior user searches for data, and it includes that search term and user are at different levels to presetting for the text sequence searched that the user, which searches for data, The click volume or volumes of searches of other multiple classifications;It include described pre- when the candidate word corresponds to semantic item according to described search word If the click volume or volumes of searches of multiple classifications at different levels generate it is described preset the first candidate words at different levels correspond to semantic item to Amount, and the click volume or search for presetting multiple classifications at different levels when including the candidate word according to described search word Amount generation is described to preset the at different levels first candidate term vectors;It calculates separately and described preset the first candidate words at different levels and correspond to language Vector similarity between senses of a dictionary entry vector and the first candidate term vector described presets at different levels multiple the to obtain respectively corresponding One vector similarity;It is pre-set corresponding with corresponding pre-set level to judge whether each primary vector similarity is greater than 6th threshold value, if so, the candidate word corresponds to, semantic item is similar to the candidate word, and otherwise, the candidate word is corresponding semantic Item is dissimilar with the candidate word.
Optionally, the third judgment module includes second similarity judgment module, is used for: generating the by preset model Two candidate words correspond to the candidate term vector of semantic item vector sum second;Second candidate word is calculated to correspond to described in semantic item vector sum The vector similarity of second candidate term vector, to obtain secondary vector similarity;Judge whether the secondary vector similarity is big In the 7th threshold value, if so, the candidate word corresponds to, semantic item is similar to the candidate word, and otherwise, the candidate word is corresponding semantic Item is dissimilar with the candidate word.
Optionally, the vector similarity is cosine similarity or Jie Kade similarity.
Optionally, the correction module is also used to: when the mapping of candidate word semantic item corresponding with the candidate word is closed When being correct: if the mapping relations of candidate word semantic item corresponding with the candidate word are not present in first index, The mapping relations are then added to first index;When the mapping relations of candidate word semantic item corresponding with the candidate word When wrong:, will if there are the mapping relations of candidate word semantic item corresponding with the candidate word in first index The mapping relations are deleted from first index.
Another aspect according to an embodiment of the present invention, provides a kind of server.
A kind of server, comprising: one or more processors;Memory, for storing one or more programs, when described When one or more programs are executed by one or more of processors, so that rope is established in the realization of one or more of processors The method drawn.
Another aspect according to an embodiment of the present invention, provides a kind of computer-readable medium.
A kind of computer-readable medium, is stored thereon with computer program, which is characterized in that described program is held by processor The method for establishing index is realized when row.
One embodiment in foregoing invention has the following advantages that or the utility model has the advantages that the text sequence of input is divided into multiple Word obtains the mapping relations of multiple words semantic item corresponding with institute's predicate according to the substring of multiple words and multiple words, according to multiple The mapping relations of semantic item corresponding with institute's predicate establish the first index, determine candidate word and the candidate by historical search data Word corresponds to the mapping relations of semantic item, to correct the first rope according to the mapping relations of candidate word semantic item corresponding with the candidate word Draw, obtains the second index.It can be indexed in commodity and recall the stage, can either guarantee recall rate to recall as much as possible and user The relevant commodity of search term, and accuracy rate can be improved to filter out incoherent commodity as much as possible.
Further effect possessed by above-mentioned non-usual optional way adds hereinafter in conjunction with specific embodiment With explanation.
Detailed description of the invention
Attached drawing for a better understanding of the present invention, does not constitute an undue limitation on the present invention.Wherein:
Fig. 1 is the key step schematic diagram of the method according to an embodiment of the present invention for establishing index;
Fig. 2 is the main modular schematic diagram of the device according to an embodiment of the present invention for establishing index;
Fig. 3 is that the embodiment of the present invention can be applied to exemplary system architecture figure therein;
Fig. 4 is adapted for the structural schematic diagram for the computer system for realizing the server of the embodiment of the present invention.
Specific embodiment
Below in conjunction with attached drawing, an exemplary embodiment of the present invention will be described, including the various of the embodiment of the present invention Details should think them only exemplary to help understanding.Therefore, those of ordinary skill in the art should recognize It arrives, it can be with various changes and modifications are made to the embodiments described herein, without departing from scope and spirit of the present invention.Together Sample, for clarity and conciseness, descriptions of well-known functions and structures are omitted from the following description.
Fig. 1 is the key step schematic diagram of the method according to an embodiment of the present invention for establishing index.
Below in order to express easily, the mapping relations of word semantic item corresponding with institute's predicate are indicated with " semantic item-word to ", and The mapping relations of candidate word semantic item corresponding with the candidate word are indicated with " candidate semantic item-word to ".Wherein, " semantic item-word It is right " in semantic item be should word in " semantic item-word to " correspondence semantic item, the word in " candidate semantic item-word to " can also Referred to as candidate word, the candidate semantic item in " candidate semantic item-word to " be should candidate word in " candidate semantic item-word to " pair It answers semantic item (i.e. candidate word corresponds to semantic item), candidate word and candidate word correspond to semantic item respectively and are the historical search according to user The word and semantic item that data determine, correspondingly, " candidate semantic item-word to " (i.e. candidate word semantic item corresponding with the candidate word Mapping relations) be according to the historical search data of user determine word and semantic item mapping relations, the candidate word with it is described Whether the mapping relations that candidate word corresponds to semantic item correctly need to determine by certain judgment criteria, specific judgment criteria Judge a candidate semantic item-word to whether correct standard will be described in detail later), if a certain candidate semantic item-word To correct, i.e., the mapping relations of a certain candidate word semantic item corresponding with the candidate word are correct, then the candidate semantic item is the time Select the semantic item of word, if a certain candidate semantic item-word to wrong, i.e., a certain candidate word semantic item corresponding with the candidate word Mapping relations are wrong, then the candidate semantic item is not the semantic item of the candidate word, and the embodiment of the present invention will be by will periodically index (i.e. First index) in wrong candidate semantic item-word to deletion, and periodically in index (i.e. first index) addition it is new and Correct candidate semantic item-word pair, thus the first index of constantly amendment, to obtain the second index.
As shown in Figure 1, the method for the foundation index of the embodiment of the present invention mainly includes the following steps, namely S101 to step S104。
Step S101: the text sequence of input is divided into multiple words, is obtained according to the substring of multiple word and multiple word Multiple semantic item-words pair.
The text sequence of input can be commodity title (title) information.
The text sequence of input is segmented according to the participle vocabulary deposited, i.e., will be inputted according to the participle vocabulary deposited Text sequence be cut into multiple individual words, to obtain multiple words.It will obtain the substring that each word is all in multiple words It finds out, wherein the adjacent combinatorics on words of one of word or multiple (including two) positions of one word of composition constitutes the word Substring, such as a word are " Gao Legao ", then the substring of the word includes: "high", " pleasure ", "high", " Gao Le ", " happy high ".One All substrings of a word are all considered as the semantic item of the word, so that multiple semantic item-words pair are obtained, semantic item-word centering word It can be described as the former word of the semantic item.
Step S102: according to multiple semantic item-words to establishing the first index.
Since all substrings of a word to be all considered as to the semantic item of the word, including it is suitable (i.e. semantic item with The mapping relations of word are correct) semantic item, such as: TV-television set, A Di-Adidas, wherein TV is the language of television set The senses of a dictionary entry, A Di are the semantic items of Adidas, and user uses these semantic items as search term and scan for the commodity recalled logical It is often to meet expected from user.It also include the semantic item of some improper (i.e. the mapping relations of semantic item and word are wrong), such as: Pineapple-jackfruit, Le Gao-Gao Legao, it is clear that " jackfruit " and " Gao Le in the commodity that user's search " pineapple ", " happy high " are recalled It is high " it is not that expectation searches as a result, user's commodity for using these semantic items to scan for recalling as search term are usual It is not meet expected from user.
First index of the embodiment of the present invention can be inverted index.It is searched by foundation " word-document " matrix to store Between rope word, semantic item (search term, semantic item are equivalent to " word ") and text sequence (text sequence is equivalent to " document ") Mapping relations can be quickly found out the collection of document comprising certain word by inverted index.
After establishing the first index, candidate semantic item-word pair is determined by historical search data, according to candidate semantic Item-word indexes amendment first, obtains the second index.It specifically may include following step S103 and step S104.
Step S103: candidate semantic item-word pair is determined according to the historical search data of user, and judges candidate semantic item- Whether word is to correct.
User's Searching point that the historical search data of user can be held from App (application program) and PC (personal computer) is held It hits in log and obtains, historical search data can be the query (search term) that nearest three months users search uses and click Text sequence, text sequence can be the commodity title (title) that typing is write by businessman of acquisition.Query (search term) and The incidence relation of commodity title (title) information of click may be expressed as: query- commodity title.Candidate semantic item-word centering Word (i.e. candidate word) be alternatively referred to as the former word of candidate semantic item-word centering candidate semantic item.
It specifically, can be true by following three kinds of modes (mode A1~mode A3) according to the historical search data of user Determine candidate semantic item-word pair.
Mode A1: the text sequence clicked to search term and user segments, to obtain search term participle and corresponding Text sequence participle;According to there are the search term of inclusion relation participle and texts in search term participle and corresponding text sequence participle This sequence, which segments, determines candidate semantic item-word pair.Wherein, historical search data may include that search term and corresponding user click Text sequence.
It before being segmented to the text sequence that search term and user are clicked, can also be normalized, specially will Query (search term) and commodity title (title) normalization, it may include capitalization turns small letter, traditional font turns simplified, full-shape and turns half-angle etc. Processing.
According to there are the search term of inclusion relation participle and text sequences in search term participle and corresponding text sequence participle Column, which segment, determines candidate semantic item-word pair, will wherein can include certain word in query with items scanning title specifically Character string select, form candidate semantic item-word pair, and it includes pass that wherein only to select the word and character string, which be substring, The character string of system, without considering word character string identical with character string.Such as:
Query are as follows: lumbar traction;
Commodity title are as follows: 04 human body otretcher cervical vertebra traction table hitch frame cervical spine tractor of brightness b draws household forever;
So, wherein by scanning commodity title, the character string comprising " traction " in query in commodity title For " traction table ", " hitch frame ", " tractor ", candidate semantic item-word is formed to correspondingly are as follows: traction-traction table is drawn-is led Draw frame, traction-tractor, and " traction " in commodity title be with " traction " identical character string in query, then do not select It takes.
Mode A1 utilizes the query- quotient in user's history search feedback data by using the historical search data of user Product title determines the mapping relations of some semantic item-words, corresponding candidate semantic item-word pair is established, by determining these times It selects semantic item-word to whether correctly, that is, can determine that the mapping relations in the first index between these corresponding semantic items and word are It is no suitable.
With the generation of the new call of the appearance or existing things of new things, some neologisms, Mei Dangyou can continuously emerge When neologisms are added in the participle vocabulary deposited, it is also required to increase the semantic item of neologisms accordingly.Such as occur a new product " octopus foot " may be the dependent merchandise for searching " octopus foot " when user searches for " octopus ", it is therefore desirable to for " octopus foot " Increase corresponding semantic item " octopus ", that is, determine semantic item-word pair are as follows: " octopus-octopus foot ".Neologisms can be by existing each New word identification algorithm is planted to identify, for example, can identify neologisms by following new word identification algorithm: counting all commodity libraries In commodity title all 4 words adjacent word below is linked to be a candidate neologisms, judges the left and right of this candidate neologisms Comentropy and coagulate it is right, when the left and right comentropy of candidate's neologisms and coagulate it is right be all larger than some threshold value when, judge the time Selecting neologisms is an independent word, which is done difference set with the participle vocabulary deposited, if the independent word is not The participle vocabulary deposited, then using the independent word as neologisms.The mode A2 and mode A3 of the embodiment of the present invention describe as identification The method that neologisms out increase semantic item.
Mode A2: the neologisms being added in the first index in advance are obtained;From the acquisition track comprising neologisms, search and new There are the search terms of inclusion relation for word, and there are the search terms of inclusion relation to determine candidate semantic item-according to neologisms and with neologisms Word pair.Wherein, historical search data may include search term and the text sequence that corresponding user clicks, and be formed by search term Acquisition track.
After neologisms are identified, it can be added in the participle vocabulary deposited, also, to query and commodity When title is segmented, the neologisms segmented in vocabulary are divided into an independent word, and are added in the first index.
Acquisition track can be the information that user searches for session (period), and it is user that a user, which searches for session, It has searched for after a word, due to not obtaining the reasons such as desired search result (such as obtained search result is too little or too much) The relevant information of another search term is replaced, user searches for session and indicates a series of relevant search behaviors of user.Example It such as, include various shrimps in the commodity title searched for after user is scanned for using " shrimp " as search term, if The expectation search result of user is " cray ", then it is " cray " that user, which can change search term, for another example, user is with " snacks After Crispy Durian Cake " is scanned for as search term, the commodity recalled and (searched) are less, or have in some commodity title both It again include durian cake including Crispy Durian Cake, and user's expectation searches more dependent merchandises, may change search term is " snacks Durian " searches for the information of session (period), in available above-mentioned two, the acquisition track of user by user are as follows: " shrimp → cray ";" snacks Crispy Durian Cake → snacks durian ".From the acquisition track comprising neologisms search with neologisms exist comprising The search term of relationship, can there are the search terms of inclusion relation to be determined as candidate semantic item-word pair by neologisms and with the neologisms.Such as In above-mentioned two example, it is assumed that " cray " and " durian " is neologisms, and in acquisition track, " shrimp " and " Crispy Durian Cake " for respectively with There are the search terms of inclusion relation for neologisms " cray " and " durian ", then candidate semantic item-word can be determined to being respectively as follows: shrimp- Cray;Durian-Crispy Durian Cake.
Mode A3: the neologisms being added in the first index in advance are obtained;According to neologisms and corresponding with particular text sequence The search term for belonging to the substring of neologisms in search term determines candidate semantic item-word pair, wherein particular text sequence be include neologisms The text sequence clicked of user, and historical search data may include the text sequence that search term and corresponding user click.
Neologisms in commodity title can be added in the participle vocabulary deposited, in this way, to query and commodity title into When row participle, the neologisms segmented in vocabulary can be divided into an independent word, and be added in the first index.It can basis The historical search data of user determines the corresponding candidate semantic item-word pair of the neologisms.Specifically, nearest three monthly calendars can be used History searches for data, wherein historical search data includes search term and the text sequence that corresponding user clicks, and wherein user clicks Text sequence in include some particular text sequences containing the neologisms, according to neologisms and corresponding with particular text sequence search The search term for belonging to the substring of neologisms in rope word determines candidate semantic item-word pair.For example, it is assumed that new containing one in commodity title The commodity of nearest three months users query used and click can be used then according to mode A3 for " Adidas " in word In title, the commodity title that user is clicked in the corresponding query of commodity title containing the neologisms " Adidas ", belong to It is selected in the query of the substring of the neologisms " Adidas ", the substring and " Adidas " is determined as candidate semantic item-word It is right, it is assumed that belong to the substring of the neologisms " Adidas " some query be " A Di ", it is determined that candidate semantic item-word to for " A Di-Adidas ".
It should be noted that the historical search data according to user of the embodiment of the present invention determines candidate semantic item-word pair Method can using above-mentioned three kinds of modes (mode A1~mode A3) one or more modes combination.
Can be judged by the one way in which in following three kinds of modes (mode B1~mode B3) aforesaid way A1~ Whether candidate semantic item-word that mode A3 is determined is to correct.
Correctly whether pass-through mode B1, mode B2, mode B3 respectively is exemplified below determines candidate semantic item-word to Specific method.For example, a candidate semantic item-word is maximum to for " fan-electric fan ", the following is corresponding goods title click volume 20 containing candidate semantic item " fan " search terms (query) relevant search and click data.
1, fan: 416734/494199 (0.84) 1690445 | |
2, fan lands: 73795/80872 (0.91) 196629 | |
3, beautiful fan: 67226/68015 (0.99) 147385 | |
4, bladeless fan: 23185/47864 (0.48) 133882 | |
5, fan is desk-top: 23055/28270 (0.82) 94839 | |
6, Amy spy fan: 18767/21304 (0.88) 43743 | |
7, Gree fan: 17340/21056 (0.82) 45162 | |
8, fan is small: 14204/26035 (0.55) 132645 | |
9, flooring fan: 13445/14779 (0.91) 36242 | |
10, bracker fan: 11801/14479 (0.82) 43809 | |
11, refrigerating fan: 9494/13310 (0.71) 25880 | |
12, mini fan: 6645/31710 (0.21) 98000 | |
13, usb fan: 6382/44415 (0.14) 113902 | |
14, quiet fan: 6114/8217 (0.74) 24376 | |
15, fan is mute: 5648/7432 (0.76) 25114 | |
16, oaks fan: 3934/4628 (0.85) 12116 | |
17, fan is mini: 3488/10803 (0.32) 34133 | |
18, remote control fan: 3280/3534 (0.93) 7300 | |
19, Industrial fan: 3246/4319 (0.75) 13773 | |
20, beautiful fan stand fan: 3211/3212 (1) 6540 | |
Wherein, for 20 query, the corresponding goods title click volume of some query is to be searched for using the query And click volume summation of the user to each commodity title when clicking each commodity title.The relevant search of 20 query and point Hitting data and corresponding to meaning is (according to the format of every relevant search and click data in 20 relevant searches and click data): " volumes of searches of value/N4 value (result of N3/N4) query1 of query1:N3 | | ", wherein query1 is expressed as follows 20 The ratio of any one query, N3/N4, that is, N3 and N4 in query, with query1 are as follows: for " fan landing ", corresponding one Relevant search and click data are as follows: " fan landing: 73795/80872 (0.91) 196629 | | ", wherein 73795 be N3's Value, 80872 be the value of N4, and the result of N3/N4 is 0.91, and the volumes of searches of the query1 is 196629.In detailed below explain N3 and The concrete meaning of N4.
N3/N4 is among all search terms including candidate semantic item-word centering candidate semantic item, and utilization rate is maximum The corresponding candidate semantic item of preset quantity search term use accounting, wherein N3 be with the search term it is corresponding and with candidate Semantic item-word is to the sum of the click volume in the click volume summation of relevant text sequence, recalled by candidate semantic item.With the search Word is corresponding and is by the search term is searched for and is clicked text sequence to relevant text sequence to candidate semantic item-word To candidate semantic item-word to relevant text sequence.It is using including candidate to relevant text sequence to candidate semantic item-word It include the candidate among the text sequence searched for and clicked when search term (query) of semantic item-word centering candidate semantic item Semantic item-candidate semantic item of word centering or the text sequence of candidate word.The sum of click volume recalled by candidate semantic item is i.e. logical Cross the search term including candidate semantic item-word centering candidate semantic item search for and click to candidate semantic item-word to related Text sequence in, user is to including candidate semantic item-word centering the sum of click volume of text sequence of candidate word.Knot This example is closed, candidate semantic item-word is to for " fan-electric fan ", it is assumed that searches 1000 commodity by query " fan landing " Title, wherein in 800 commodity title comprising " fan " or " electric fan " (such as commodity title be " beautiful landing electric wind Fan ", " beautiful flooring fan "), 200 commodity title do not include " electric fan ", and (certain commodity title is " beautiful landing cold wind Fan "), then 800 commodity title comprising " fan " or " electric fan " are with candidate semantic item-word to " fan-electric wind The relevant text sequence of fan ", and both do not included " fan " do not include " electric fan " 200 commodity title be not then with To " fan-electric fan " relevant text sequence, (200 commodity title may be and other candidate languages candidate semantic item-word The senses of a dictionary entry-word is to relevant text sequence).And, it is assumed that in 800 commodity title comprising " fan " or " electric fan ", have In 300 commodity title comprising " electric fan ", 500 include " fan ", then according to the definition of N3, N3 is and the search term It corresponds to and to candidate semantic item-word to the click volume in the click volume summation of relevant text sequence, recalled by candidate semantic item The sum of, according to this example, user is N3 to the sum of click volume of 300 commodity title comprising " electric fan ".In conjunction with this example In search and click data, for example, " fan landing: 73795/80872 (0.91) 196629 | | ", user is to including " electric wind The sum of click volume of 300 commodity title of fan " is 73795, i.e. N3=73795.
N4 is text sequence click volume summation corresponding with the search term, refers specifically to search for and click by the search term In each text sequence, each text of the user to wherein at least one text sequence item containing candidate semantic-word centering candidate word The click volume summation of sequence.According to this example, candidate semantic item-word is to for " fan-electric fan ", with query for " fan landing " For, it is assumed that user searches 1000 commodity title using the query, then 1000 commodity title should at least one A commodity title contains " electric fan ", and the click volume summation at this moment counting 1000 commodity title is N4.In conjunction in this example Search and click data, " fan landing: 73795/80872 (0.91) 196629 | | " among, 1000 commodity title's Click volume summation is 80872, i.e. N4=80872.The ratio of N3/N4, that is, N3 and N4 are as follows: 73795/80872=0.91.Due to making The commodity title searched with the query " fan landing " it is all to other candidate semantic item-words to relevant commodity The data obtained when title for determine candidate semantic item-word to it is whether correct for " fan-electric fan " when it is nonsensical, Therefore to limit 1000 commodity title should at least one commodity title item containing candidate semantic-word to for " fan-electricity Candidate word " electric fan " in fan ", that is, guarantee to search for and in the commodity title that clicks an at least commodity title be with Candidate semantic item-word is to for " fan-electric fan " relevant commodity title.
Text sequence (such as commodity that volumes of searches, that is, user of query1 searches when using query1 as search term Title total quantity), in conjunction in this example search and click data, for example, query1 be " fan landing, it is corresponding search and Click data be " fan landing: 73795/80872 (0.91) 196629 | | ", the volumes of searches of " fan landing " this search term is It is 196629.
Mode B1: judge the totality of candidate semantic item-word centering candidate semantic item using accounting whether more than the first threshold Value (is set as threshold_a), if so, candidate semantic item-word, to correct, otherwise, candidate semantic item-word is to wrong.It is candidate The totality of semantic item using accounting be N1 and N2 ratio, wherein N1 be it is all with including the search term of candidate semantic item it is corresponding, And to candidate semantic item-word in the click volume summation of relevant text sequence, the sum of the click volume recalled by candidate semantic item, N2 be it is all with including the corresponding text sequence click volume summation of the search term of candidate semantic item.
In conjunction with this example, it is assumed that the search term including candidate semantic item " fan " is W (the correspondence quotient that this example is shown in which Maximum 20 query of product title click volume), and user searches for and clicks 5000 commodity title using the W query, It (i.e. include wherein " fan " or " electric wind in 5000 commodity title to relevant commodity title to candidate semantic item-word The commodity title of fan ") it is 3000, also, the commodity title in 3000 commodity title comprising " electric fan " is 2000 A, then N1 is click volume the sum of of the user to 2000 commodity title comprising " electric fan ".N2 is all and includes candidate The corresponding text sequence click volume summation of the search term of semantic item is referred specifically to through search term all and including candidate semantic item When searching for each text sequence, user is each to wherein at least one text sequence item containing candidate semantic-word centering candidate word The click volume summation of text sequence.In conjunction with this example, user is searched for and is clicked in 5000 commodity title using the W query, Should at least one commodity title contain " electric fan ", it is assumed that N1 916978, N2 1339534, then N1/N2, that is, N1 and N2 Ratio be 916978/1339534=0.68.
Mode B2: it calculates among all search terms including candidate semantic item-word centering candidate semantic item, utilization rate is most The corresponding candidate semantic item of big preset quantity search term uses accounting (being set as query_sem_useRato);Judgement Among the maximum preset quantity search term of the utilization rate, if there are at least one search terms to meet following condition: the search The corresponding candidate semantic item of word is greater than second threshold (being set as threshold_b_useRato) using accounting, and the search term Volumes of searches (being set as query_search) is greater than third threshold value (being set as threshold_b_pv), and the corresponding text of the search term Sequence click volume (being set as query_click) is greater than the 4th threshold value (being set as threshold_b_click), and the search term is corresponding Text sequence clicking rate (being set as ctr) be greater than the 5th threshold value (being set as threshold_ctr).If there is this, at least one is searched Rope word, then candidate semantic item-word is to correct, and otherwise, the candidate semantic item-word is to wrong.
The ratio that the corresponding candidate semantic item of each search term is N3 and N4 using accounting, wherein N3 is and the search term It corresponds to and to candidate semantic item-word to the click volume in the click volume summation of relevant text sequence, recalled by candidate semantic item The sum of, N4 is text sequence click volume summation corresponding with the search term.
The concrete meaning of N3 and N4 are explained in detail above, details are not described herein again.Preset quantity in mode B2 It can be set to 10, in conjunction with this example, it is assumed that the search term including candidate semantic item " fan " is W, then calculates separately the W and search The corresponding candidate semantic item of each search term uses accounting N3/N4 (i.e. the ratio of N3 and N4) in rope word, then to being calculated W candidate semantic item be ranked up using accounting, even if choose wherein the maximum 10 candidate semantic items of numerical value use accounting Accounting is used with the corresponding candidate semantic item of maximum 10 search terms of rate.Also, judging candidate semantic item-word to just When true, if in 10 search terms, there is at least one search term satisfaction: query_search > threshold_b_pv, and Query_click > threshold_b_click, and query_sem_useRato > threshold_b_useRato, and ctr > Threshold_ctr, then candidate semantic item-word to " fan-electric fan " correctly, otherwise the candidate semantic item-word is to wrong.
Mode B3: judge whether candidate semantic item-word centering candidate semantic item is similar to candidate word, if candidate semantic item Similar to candidate word, then candidate semantic item-word is to correct, and otherwise, candidate semantic item-word is to wrong.
Wherein, in mode B3, any one mode among two ways (mode B3-1, mode B3-2) can be passed through Judge whether candidate semantic item-word centering candidate semantic item is similar to candidate word.
Mode B3-1: obtaining the user in default historical time section and search for data, user search for data include search term and The click volume or volumes of searches of presetting at different levels multiple classifications of the user to the text sequence searched;Including according to search term should The click volume for presetting multiple classifications at different levels or volumes of searches generation when candidate semantic item preset the first candidate words at different levels Correspondence semanteme item vector (the corresponding semantic item vector of candidate word in the embodiment of the present invention can be also simply referred to as candidate semantic item vector, Therefore the corresponding semantic item vector of the first candidate word is alternatively referred to as the first candidate semantic item vector), and including according to search term should The click volume for presetting multiple classifications at different levels or volumes of searches generation when candidate word preset the at different levels first candidate term vectors; The vector similarity preset between the first candidate semantic item vectors at different levels and the first candidate term vector is calculated separately, to obtain It respectively corresponds and presets multiple primary vector similarities at different levels;Judge whether each primary vector similarity is greater than to preset The 6th threshold value corresponding with corresponding pre-set level, if so, the candidate semantic item is similar to the candidate word, otherwise, the candidate Semantic item and the candidate word are dissimilar.
Below with candidate semantic item-word to for " fan-electric fan ", (wherein candidate semantic item is " fan ", candidate's language The former word of the senses of a dictionary entry " fan " is " electric fan "), mode B3-1 is discussed in detail for being three in the quantity of pre-set level.Search can be used The search click data within the default historical time section (such as six months) at (such as end APP) is held, to judge candidate semantic item-word pair In candidate semantic item it is whether similar to candidate word.
When the click volume of former word and semantic item is all larger than preset value, (preset value can according to need setting, such as default When value is set as 100), then the click volume of multiple classifications of preset three ranks when including the candidate semantic item according to search term Generate the first candidate semantic item vector (candidate language of the first candidate semantic of level-one item vector, second level first of preset three ranks Senses of a dictionary entry vector, the first candidate semantic of three-level item vector), and preset three ranks when including the candidate word according to search term The click volumes of multiple classifications generate the first candidate term vector (level-one first candidate term vector, second level of preset three ranks First candidate term vector, the candidate term vector of three-level first).Assuming that first level shares 5 classifications, that is, there are 5 level-one classifications, if Level-one the first candidate semantic item vector is v_semantic_cid1, if the candidate term vector of level-one first is v_word_cid1.It searches 3 level-one classifications are clicked when rope word includes " fan ", which is (1,2,3) and corresponding click volume For (100,50,50), forming 5 dimensional vectors is (100,50,50,0,0), and it is candidate that level-one first is obtained through normalized Semantic item vector v _ word_cid1 is (0.5,0.25,0.25,0,0), and 2 level-one classes are clicked when search term includes " electric fan " Mesh, which is (3,4) and the amount of clicking on is (200,400), form 5 dimensional vectors be (0,200, 400,0,0) obtain after, being normalized the first candidate word of level-one vector v _ word_cid1 for (0,0,0.333, 0.667,0).Similarly, it can determine that the first candidate semantic item vector for generating other two rank (second level, three-level) and first is waited Term vector is selected, second level the first candidate semantic item vector is set as v_semantic_cid2, and three-level the first candidate semantic item vector is set as V_semantic_cid3, the candidate term vector of second level first are set as v_word_cid2, and the candidate term vector of three-level first is set as v_ word_cid3.It calculates separately between level-one the first candidate semantic item vector and the candidate term vector of level-one first, the first candidate of second level Between semantic item vector and the candidate term vector of second level first, three-level the first candidate semantic item vector and three-level the first candidate term vector Between cosine similarity or Jie Kade similarity.
To calculate the cosine similarity preset between the first candidate semantic item vectors at different levels and the first candidate term vector For, in which:
Cosine similarity between level-one the first candidate semantic item vector and the candidate term vector of level-one first (is set as Sim_ Cid1_cos) are as follows:
Sim_cid1_cos=cosine (v_word_cid1, v_semantic_cid1);
Cosine similarity between second level the first candidate semantic item vector and the candidate term vector of second level first (is set as Sim_ Cid2_cos) are as follows:
Sim_cid2_cos=cosine (v_word_cid2, v_semantic_cid2);
Cosine similarity between three-level the first candidate semantic item vector and the candidate term vector of three-level first (is set as Sim_ Cid3_cos) are as follows:
Sim_cid3_cos=cosine (v_word_cid3, v_semantic_cid3).
It is similar that the Jie Kade between the first candidate semantic item vectors at different levels and the first candidate term vector is preset with calculating Degree (is set as with calculating the Jie Kade similarity between the first candidate semantic item vector of third level and the first candidate term vector Sim_cid3_jacc for):
Wherein, vword3For the candidate term vector of three-level first, vse3For three-level the first candidate semantic item vector, wherein ∩ is indicated The element number of the intersection of the candidate term vector of three-level first and the first candidate semantic of three-level item vector, ∪ indicate that three-level first is candidate The element number of the union of term vector and the first candidate semantic of three-level item vector.According to this example, such as search term includes " electric wind 3 three-level classifications are clicked when fan ", the candidate term vector of three-level first is (1,2,3), and 4 three are clicked when search term includes " fan " Grade classification, three-level the first candidate semantic item vector is (2,3,4,5), then according to the formula of above-mentioned calculating Jie Kade similarity: Sim_cid3_jacc=2/5.Similarly, level-one the first candidate semantic item vector and the candidate term vector of level-one first can be calculated Between the first candidate word of Jie Kade similarity (being set as Sim_cid1_jacc), second level the first candidate semantic item vector and second level Jie Kade similarity (being set as Sim_cid2_jacc) between vector.
When the click volume of former word or semantic item be less than or equal to above-mentioned preset value (preset value can according to need setting, Such as preset value presets multiple classifications at different levels when then can include the candidate semantic item according to search term when being set as 100) Volumes of searches generation preset the first candidate semantic item vectors at different levels, and it is default when including the candidate word according to search term The at different levels first candidate term vectors are preset in the volumes of searches generation of multiple classifications at different levels, then according to the calculating of above-mentioned introduction Formula first order calculation the first candidate semantic item vector of cosine similarity or Jie Kade similarity between vector and level-one the Between one candidate term vector, between second level the first candidate semantic item vector and the candidate term vector of second level first, three-level first it is candidate Cosine similarity or Jie Kade similarity between semantic item vector and the candidate term vector of three-level first.Also, search term packet The volumes of searches for presetting multiple classifications at different levels when including the candidate semantic item can be the search term grabbed from line (query) search result specifically can also only choose the search result for wherein setting quantity, such as choose 300 therein, It is at different levels multiple to count presetting for corresponding goods when search term includes candidate semantic item and candidate word in 300 commodity titles Then the volumes of searches of classification generates and presets the first candidate semantic item vectors at different levels and the first candidate term vector, then calculates each Cosine similarity or Jie Kade similarity between grade candidate semantic item vector and the first candidate term vector.
Mode B3-2: corresponding semantic item vector (the alternatively referred to as second candidate language of the second candidate word is generated by preset model Senses of a dictionary entry vector) and the second candidate term vector;It calculates the second candidate semantic item vector and (is set as vword2vec_semantic) and the second candidate Term vector (is set as vword2vec_word) vector similarity (being set as Sim_em (word, semantic)), to obtain secondary vector Similarity;Judge whether secondary vector similarity is greater than the 7th threshold value, if so, the candidate semantic item is similar to the candidate word, Otherwise, the candidate semantic item and the candidate word are dissimilar.
Preset model can be word2vec model, and Word2vec model is that Google (Google) increased income in 2013 The kit that one released indicates for obtaining the low dimensional vector of word, the Word2vec model can be used and deposited commodity library In commodity title reuse after training.
By taking secondary vector similarity is cosine similarity as an example, secondary vector similarity are as follows:
Sim_em (word, semantic)=cosine (word2vec_word, word2vec_semantic)
By taking secondary vector similarity is Jie Kade similarity as an example, secondary vector similarity are as follows:
It should be noted that the embodiment of the present invention can by employing mode B1, in a manner of any one side among B2, mode B3 Formula is come candidate semantic item-word pair determined by judging to combine by one of above-mentioned mode A1~mode A3 or various ways It is whether correct, and among mode B3, can by pass-through mode B3-1, in a manner of any one mode among the B3-2 two ways To judge whether candidate semantic item-word centering candidate semantic item is similar to candidate word.
The present invention be used each threshold value among mode B1~mode B3 of embodiment (i.e. above-mentioned first threshold, Second threshold, third threshold value, the 4th threshold value, the 5th threshold value, the 6th threshold value, the 7th threshold value) it can be according to candidate semantic item-word centering Candidate semantic item and candidate word be product word or non-product word respectively and be set as different specific values.Candidate semantic item- The candidate semantic item and candidate word (i.e. the former word of candidate semantic item) of word centering be product word respectively or non-product word may include as Lower four kinds of situations:
Candidate semantic item is product word, and former word is also product word (being denoted as p-p);
Candidate semantic item is product word, former word non-product word (being denoted as p-np);
Candidate semantic item non-product word, former word are product word (being denoted as np-p);
Candidate semantic item non-product word, former word non-product word (being denoted as np-np).
Due between semantic relation and non-product word based on (such as durian-Crispy Durian Cake) between product word (such as winter in winter -) Semantic relation and user way of search different (product word usually can be separately as search term, and qualifier usually will not be independent As search term but usually all it is mutually to arrange in pairs or groups to carry out group and be combined into search term with product word), it can be adopted respectively by above four kinds of situations Judge candidate semantic item-word to whether just with any one mode among above-mentioned three kinds of modes (mode B1~mode B3) Really, in four kinds of situations, the threshold value setting numerical value of each case is all different, to improve the filter effect of commodity.
In addition, product word or non-product word can identify by searching for the product vocabulary deposited.It is typically based on product word It is relatively good come the filter effect that filters incoherent commodity, in order to improve product vocabulary accurately to identify product word, Some product words can be excavated from non-product word (the case where np-np).Specific practice is as follows: owning in statistics candidate semantic item Individual character, and with this individual character end up former word, for example, beans kidney bean fiber crops beans, grind beans, Jin Dou, Kidney bean, cowpea, red bean, she Beans, foolish beans decoct beans, kidney bean, green soy bean, cruel beans, magic beans, dry beans, sauce beans, meter Dou, color beans, lima bean, Hua Dou, hyacinth bean, sweet beans, good fortune Beans, bamboo shoot beans, molten beans choose beans, milk beans, tonka-bean, the baked beans of waist beans, } };{ machine { replicating machine, cigarette machine, seal carving machine, walking machine, rotary tillage Machine, guillotine, adhesive tape machine, deposit machine, beveler, hatching egg machine, pit-picker, ruddy complexion machine, closer, fragrant scent machine, belt sander, umbrella bag Machine, die head threading machine, courtyard machine, cold spraying machine, trimmer, broken rouge machine, doll machine, football machine, sander, seeder, shield machine, the chief minister's seal Machine, internal combustion engine, wax therapy machine, Blank drawing machine } }, count the number and accounting of product word in the former word to end up with some word.By former word It is divided into 2 words and 3 words and both the above situation, wherein can not be to produce by manually searching and verifying the case where for 2 words Those of product word original word, and this word being added in white list after verification is product word can be with the case where for 3 words One condition is set, such as product word number >=10 and product word accounting >=25%, white name is then added in the word for meeting the condition Dan Zhong.It is the product word newly identified from being selected in np-np with the word that the word in white list ends up.
Step S104: it is indexed according to the modified result first of judgement, to obtain the second index.
Specifically, the step of being indexed according to the modified result of judgement first, can specifically include: when candidate semantic item-word To it is correct when: if first index in be not present candidate semantic item-candidate semantic item of word centering and the mapping relations of candidate word, The mapping relations are then added to the first index;When candidate semantic item-word to it is wrong when: if there is candidate language in the first index The senses of a dictionary entry-candidate semantic item of word centering and the mapping relations of candidate word then delete the mapping relations from the first index, thus So that present in the first index of the semantic item-word obtained based on step S101 to foundation between some semantic items and word not Suitable mapping relations are deleted,
According to embodiments of the present invention, mode A1 is true using the query- commodity title in user's history search feedback data The mapping relations for making some semantic item-words establish corresponding candidate semantic item-word pair, the candidate language that pass-through mode A1 is determined The senses of a dictionary entry-word pair mapping relations be first index in already present semantic item-word pair mapping relations, can with pass-through mode B1, Any one mode among mode B2, mode B3 (mode B3-1 or mode B3-2) judges the candidate language that pass-through mode A1 is determined Whether the senses of a dictionary entry-word pair mapping relations are correct, if the mapping relations are wrong, delete corresponding semantic item-in the first index The mapping relations of word pair retain the mapping relations in the first index if the mapping relations are correct.Mode A2 is from comprising new In the acquisition track of word, there are the search terms of inclusion relation with neologisms for lookup, and there are inclusion relations according to neologisms and with neologisms Search term determine candidate semantic item-word pair, candidate semantic item-word centering word that pass-through mode A2 is determined is neologisms, first There is no mapping relations between the neologisms and corresponding semantic item in index, can by pass-through mode B1, in a manner of B2, mode B3 Any one mode among (mode B3-1 or mode B3-2) judges candidate semantic item-word that pass-through mode A2 is determined to reflecting It whether correct penetrates relationship, if the mapping relations are correct, add candidate semantic item-word pair mapping in the first index and close System does not add the mapping relations in the first index if the mapping relations mistake.Mode A3 according to neologisms and with specific text The search term for the substring for belonging to neologisms in the corresponding search term of this sequence determines candidate semantic item-word pair, wherein particular text sequence It is classified as the text sequence that the user including neologisms clicks.Candidate semantic item-word centering word that pass-through mode A3 is determined also is new Word, there is no mapping relations between the neologisms and corresponding semantic item in the first index, can by pass-through mode B1, in a manner of B2, Any one mode among mode B3 (mode B3-1 or mode B3-2) judges candidate semantic item-word that pass-through mode A3 is determined Pair mapping relations it is whether correct, if the mapping relations are correct, first index in add the candidate semantic item-word pair Mapping relations do not add the mapping relations in the first index if the mapping relations mistake.In pass-through mode B1, mode Any one mode among B2, mode B3 (mode B3-1 or mode B3-2) judges that pass-through mode A2 and/or mode A3 is determined Candidate semantic item-word pair mapping relations it is whether correct when, can be according to reality to the confidence level demand of the semantic item of neologisms Second threshold, third threshold value, the 4th threshold value, the 5th threshold value, the 6th threshold value, the specific value of the 7th threshold value are set.
On the one hand the embodiment of the present invention ensure that the word of regular collocation, so that in index without considering the search of user Word directly can be established an index entry by mode, such as " one-piece dress ", " television set ", " non-slip mat ".By establishing semantic item word It is right: (even clothing/skirt-one-piece dress), (TV-television set), (anti-skidding/pad -- non-slip mat), when user's search " even clothing longuette ", " millet TV ", " when bathroom Anti-slip ground cushion, literary style is " one-piece dress surplus ", " millet intelligent cloud television in corresponding product name The commodity of machine ", " non-slip mat bathhouse article " etc. still are able to be called back.On the other hand, increase neologisms and phase in the first index The mapping relations between semantic item answered, can be improved the accuracy rate in the stage of recalling.For example, neologisms are " Door frame " " bathroom Cabinet " " mobile phone protecting case " ' etc., increasing corresponding semantic item can avoid needing in the participle stage by the excessively scattered of word point later (i.e. word cutting is cut very broken) is just able to satisfy the search need of user, to avoid recalling some incoherent commodity.For example, being " Door frame " adds semantic item " wood ", and user's search " wood chopsticks " can recall the commodity in commodity title comprising " Door frame ", And if " Door frame " does not have " wood " this semantic item, can only segment in the participle stage just can guarantee user for " chicken wings " and " wood " The commodity in commodity title comprising " Door frame " are recalled when searching for " wood chopsticks ", and be will lead to user in this way and searched for " chicken wings " Shi Zhao Return the uncorrelated commodity such as some chopsticks, desk, tea trays such as " Door frames ".Equally, for another example, " cowboy is blue " be not if corresponding " indigo plant " this semantic item, then needing to segment in the participle stage is " cowboy " and " indigo plant ", and segments will lead to user search " ox in this way The housing of the cowboy Lan Yanse of non-cowboy's material is recalled when young housing ".Semantic item is in index by character segmentation and by word segmentation Compromise increases in quickly index in commodity number and plays an important role, is that the diversified search term of user and businessman standardize Bridge between product name.The embodiment of the present invention passes through the mapping relations of semantic item and word in the first index of periodically amendment, And the neologisms to continuously emerge increase the mapping relations of semantic item and neologisms, in the case where guaranteeing commodity recall rate, raising is called together Return the accuracy rate of commodity.
It should be noted that the embodiment of the present invention can be periodically wrong in the first index of deletion (with period first time) Semantic item-word mapping relations, and periodically (with second time period) increases the mapping of neologisms and semantic item in the first index Relationship, wherein period first time and second time period can be identical or different, for example, can be every three months sides of passing through Formula A1 determine candidate semantic item-word to and judge candidate semantic item-word of the determination to it is wrong when, delete wrong semanteme The mapping relations of item-word, and every 6 months by correctly mapping relations are added among the mapping relations of neologisms and candidate semantic item To the first index.Alternatively, semantic item-word mapping relations wrong in the first index are deleted every three months, while will be new The mapping relations of correct semantic item-word of word are added to the first index.The embodiment of the present invention uses the historical search number of user According to by the continuous feedback of user, finding out the mapping relations of inappropriate semantic item-word in the first index, and periodically delete this The mapping relations of a little inappropriate semantic item-words, and periodically increase reflecting for some neologisms and its semantic item in the first index Relationship is penetrated, so constantly iteration updates, and routinely corrects the first index and obtains the second index, can be well adapted for user Search habit, guarantee index correctness and timeliness.
Fig. 2 is the main modular schematic diagram of the device according to an embodiment of the present invention for establishing index.
As shown in Fig. 2, the device 200 of the foundation index of the embodiment of the present invention specifically includes that word to generation module 201, rope Draw and establishes module 202, word to judgment module 203, index correction module 204.
Word is to generation module 201, for the text sequence of input to be divided into multiple words, according to the son of multiple words and multiple words String obtains multiple semantic item-words pair.
Index establishes module 202, for according to multiple semantic item-words to establishing the first index.
Word determines candidate semantic item-word pair to judgment module 203, for the historical search data according to user, and judges Whether candidate semantic item-word is to correct.
Correction module 204 is indexed, for indexing according to the modified result first of judgement, to obtain the second index.
Wherein, word can pass through an independent module, that is, correction module to judgment module 203 and index correction module 204 It realizes, which is mainly used for determining reflecting for candidate word semantic item corresponding with the candidate word by historical search data Relationship is penetrated, to obtain the second index according to the first index of the mapping relations of candidate word semantic item corresponding with candidate word amendment. Wherein, the mapping relations of word semantic item corresponding with institute's predicate are indicated, and with " semantic item-word to " with " candidate semantic item-word to " Indicate the mapping relations of candidate word semantic item corresponding with the candidate word.The correction module may include word to 203 He of judgment module Index correction module 204.
According to one embodiment of present invention, historical search data may include search term and the text that corresponding user clicks Sequence, word may include the first determining module to judgment module 203, be used for: the text sequence clicked to search term and user carries out Participle, to obtain search term participle and corresponding text sequence participle;According to search term participle and corresponding text sequence participle Middle segment there are the search term of inclusion relation participle and text sequence determines candidate semantic item-word pair.
According to another embodiment of the present invention, historical search data may include search term and the text that corresponding user clicks Sequence, and the acquisition track formed by search term, word may include the second determining module to judgment module 203, be used for: obtain pre- The neologisms being first added in the first index;From the acquisition track comprising neologisms, there are the search of inclusion relation with neologisms for lookup Word, and there are the search terms of inclusion relation to determine candidate semantic item-word pair according to neologisms and with neologisms.
According to still another embodiment of the invention, historical search data may include search term and the text that corresponding user clicks Sequence, word may include third determining module to judgment module 203, be used for: obtain the neologisms being added in the first index in advance;Root Candidate semantic item-word is determined according to the search term for the substring for belonging to neologisms in neologisms and search term corresponding with particular text sequence It is right, wherein particular text sequence be include the neologisms user click text sequence.
According to still another embodiment of the invention, word to judgment module 203 may include the first above-mentioned determining module, second really Any one or more modules among cover half block, third determining module three.
According to one embodiment of present invention, word may include first judgment module to judgment module 203, be used for: judgement is waited Whether using accounting be more than first threshold, if so, candidate semantic item-if selecting the totality of semantic item-word centering candidate semantic item Word is to correct, otherwise, candidate semantic item-word to ratio wrong, that the totality of candidate semantic item is N1 and N2 using accounting, In, N1 is all corresponding to the search term including candidate semantic item and with candidate semantic item-word to the point of relevant text sequence In the amount of hitting summation, the sum of the click volume recalled by candidate semantic item, N2 is all corresponding with including the search term of candidate semantic item Text sequence click volume summation.
According to another embodiment of the present invention, word may include the second judgment module to judgment module 203, be used for: calculate institute Have among the search term including candidate semantic item-word centering candidate semantic item, the maximum preset quantity search term of utilization rate Corresponding candidate semantic item uses accounting;Judge among the maximum preset quantity search term of utilization rate, if exist to A few search term meets following condition: the corresponding candidate semantic item of the search term is greater than second threshold using accounting, and this is searched The volumes of searches of rope word is greater than third threshold value, and the corresponding text sequence click volume of the search term is greater than the 4th threshold value, and the search The corresponding text sequence clicking rate of word is greater than the 5th threshold value.If there is at least one search term, then candidate semantic item-word is to just Really, otherwise, candidate semantic item-word is to wrong, the ratio that the corresponding candidate semantic item of each search term is N3 and N4 using accounting, Wherein, N3 is corresponding to the search term and with candidate semantic item-word in the click volume summation of relevant text sequence, by waiting The sum of the click volume for selecting semantic item to recall, N4 are text sequence click volume summation corresponding with the search term.
According to still another embodiment of the invention, word may include third judgment module to judgment module 203, be used for: judgement is waited Select semantic item-word centering candidate semantic item whether similar to candidate word, if candidate semantic item is similar to candidate word, candidate language The senses of a dictionary entry-word is to correct, and otherwise, candidate semantic item-word is to wrong.
Wherein, third judgment module may include first similarity judgment module, be used for: obtain in default historical time section User searches for data, and it includes that search term and user are at different levels multiple to presetting for the text sequence searched that user, which searches for data, The click volume or volumes of searches of classification;The click volume for presetting multiple classifications at different levels when according to search term including candidate semantic item Or the first candidate semantic item vectors at different levels are preset in volumes of searches generation, and default each when including candidate word according to search term The at different levels first candidate term vectors are preset in click volume or the volumes of searches generation of multiple classifications of rank;Calculate separately preset it is at different levels Vector similarity between other first candidate semantic item vector and the first candidate term vector, with obtain respectively corresponding preset it is at different levels Other multiple primary vector similarities;Judge whether each primary vector similarity is greater than pre-set and corresponding pre-set level Corresponding 6th threshold value, if so, candidate semantic item is similar to the candidate word, otherwise, candidate semantic item and candidate word not phase Seemingly.
Alternatively, third judgment module may include second similarity judgment module, it is used for: generates second by preset model and wait Select the candidate term vector of semantic item vector sum second;The vector for calculating the second candidate semantic item vector and the second candidate term vector is similar Degree, to obtain secondary vector similarity;Judge whether secondary vector similarity is greater than the 7th threshold value, if so, candidate semantic item with Candidate word is similar, and otherwise, candidate semantic item and candidate word are dissimilar.
Above-mentioned vector similarity can be cosine similarity or Jie Kade similarity.
Index correction module 204 can when candidate semantic item-word to it is correct when: if first index in there is no candidate Semantic item-candidate semantic item of word centering and the mapping relations of candidate word, then be added to the first index for the mapping relations;Working as Candidate semantic item-word to it is wrong when: if first index in there are candidate semantic item-word centering candidate semantic item and candidate words Mapping relations, then by the mapping relations from first index in delete.
In addition, the specific implementation content for the device that the foundation indexes in embodiments of the present invention, foundation described above It has been described in detail in the method for index, therefore has no longer illustrated in this duplicate contents.
Fig. 3 is shown can be using the exemplary of the device of the method or foundation index of the foundation index of the embodiment of the present invention System architecture 300.
As shown in figure 3, system architecture 300 may include terminal device 301,302,303, network 304 and server 305. Network 304 between terminal device 301,302,303 and server 305 to provide the medium of communication link.Network 304 can be with Including various connection types, such as wired, wireless communication link or fiber optic cables etc..
User can be used terminal device 301,302,303 and be interacted by network 304 with server 305, to receive or send out Send message etc..Various telecommunication customer end applications, such as the application of shopping class, net can be installed on terminal device 301,302,303 The application of page browsing device, searching class application, instant messaging tools, mailbox client, social platform software etc..
Terminal device 301,302,303 can be the various electronic equipments with display screen and supported web page browsing, packet Include but be not limited to smart phone, tablet computer, pocket computer on knee and desktop computer etc..
Server 305 can be to provide the server of various services, such as utilize terminal device 301,302,303 to user The shopping class website browsed provides the back-stage management server supported.Back-stage management server can believe the product received The data such as breath inquiry request carry out the processing such as analyzing, and processing result (such as product information) is fed back to terminal device.
It should be noted that the method that index is established provided by the embodiment of the present invention is generally executed by server 305, phase Ying Di, the device for establishing index are generally positioned in server 305.
It should be understood that the number of terminal device, network and server in Fig. 3 is only schematical.According to realization need It wants, can have any number of terminal device, network and server.
Below with reference to Fig. 4, it illustrates the computer systems 400 for the server for being suitable for being used to realize the embodiment of the present application Structural schematic diagram.Server shown in Fig. 4 is only an example, should not function and use scope band to the embodiment of the present application Carry out any restrictions.
As shown in figure 4, computer system 400 includes central processing unit (CPU) 601, it can be read-only according to being stored in Program in memory (ROM) 402 or be loaded into the program in random access storage device (RAM) 403 from storage section 408 and Execute various movements appropriate and processing.In RAM 403, also it is stored with system 400 and operates required various programs and data. CPU 401, ROM 402 and RAM 403 are connected with each other by bus 404.Input/output (I/O) interface 405 is also connected to always Line 404.
I/O interface 405 is connected to lower component: the importation 406 including keyboard, mouse etc.;It is penetrated including such as cathode The output par, c 407 of spool (CRT), liquid crystal display (LCD) etc. and loudspeaker etc.;Storage section 408 including hard disk etc.; And the communications portion 409 of the network interface card including LAN card, modem etc..Communications portion 409 via such as because The network of spy's net executes communication process.Driver 410 is also connected to I/O interface 405 as needed.Detachable media 411, such as Disk, CD, magneto-optic disk, semiconductor memory etc. are mounted on as needed on driver 410, in order to read from thereon Computer program be mounted into storage section 408 as needed.
Particularly, disclosed embodiment according to the present invention, can be by above with reference to the process of key step schematic diagram description It is embodied as computer software programs.For example, embodiment disclosed by the invention includes a kind of computer program product comprising carrying Computer program on a computer-readable medium, the computer program include for executing side shown in key step schematic diagram The program code of method.In such embodiments, the computer program can be downloaded from network by communications portion 409 and Installation, and/or be mounted from detachable media 411.When the computer program is executed by central processing unit (CPU) 401, hold The above-mentioned function of being limited in the system of row the application.
It should be noted that computer-readable medium shown in the present invention can be computer-readable signal media or meter Calculation machine readable storage medium storing program for executing either the two any combination.Computer readable storage medium for example can be --- but not Be limited to --- electricity, magnetic, optical, electromagnetic, infrared ray or semiconductor system, device or device, or any above combination.Meter The more specific example of calculation machine readable storage medium storing program for executing can include but is not limited to: have the electrical connection, just of one or more conducting wires Taking formula computer disk, hard disk, random access storage device (RAM), read-only memory (ROM), erasable type may be programmed read-only storage Device (EPROM or flash memory), optical fiber, portable compact disc read-only memory (CD-ROM), light storage device, magnetic memory device, Or above-mentioned any appropriate combination.In this application, computer readable storage medium can be it is any include or storage journey The tangible medium of sequence, the program can be commanded execution system, device or device use or in connection.And at this In application, computer-readable signal media may include in a base band or as carrier wave a part propagate data-signal, Wherein carry computer-readable program code.The data-signal of this propagation can take various forms, including but unlimited In electromagnetic signal, optical signal or above-mentioned any appropriate combination.Computer-readable signal media can also be that computer can Any computer-readable medium other than storage medium is read, which can send, propagates or transmit and be used for By the use of instruction execution system, device or device or program in connection.Include on computer-readable medium Program code can transmit with any suitable medium, including but not limited to: wireless, electric wire, optical cable, RF etc. are above-mentioned Any appropriate combination.
Key step schematic diagram and block diagram in attached drawing, illustrate according to the system of the various embodiments of the application, method and The architecture, function and operation in the cards of computer program product.In this regard, in key step schematic diagram or block diagram Each box can represent a part of a module, program segment or code, one of above-mentioned module, program segment or code Subpackage is containing one or more executable instructions for implementing the specified logical function.It should also be noted that at some as replacement Realization in, function marked in the box can also occur in a different order than that indicated in the drawings.For example, two in succession The box that ground indicates can actually be basically executed in parallel, they can also be executed in the opposite order sometimes, this is according to involved And function depending on.It is also noted that each box and block diagram or key step in block diagram or key step schematic diagram The combination of box in schematic diagram, can the dedicated hardware based systems of the functions or operations as defined in executing realize, Or it can realize using a combination of dedicated hardware and computer instructions.
Being described in module involved in the embodiment of the present invention can be realized by way of software, can also be by hard The mode of part is realized.Described module also can be set in the processor, for example, can be described as: a kind of processor packet It includes word and module 202, word is established to judgment module 203, index correction module 204 to generation module 201, index.Wherein, these moulds The title of block does not constitute the restriction to the module itself under certain conditions, for example, word can also retouch generation module 201 It states " for the text sequence of input to be divided into multiple words, to obtain multiple semantic item-words according to the substring of multiple words and multiple words Pair module ".
As on the other hand, the present invention also provides a kind of computer-readable medium, which be can be Included in equipment described in above-described embodiment;It is also possible to individualism, and without in the supplying equipment.Above-mentioned calculating Machine readable medium carries one or more program, when said one or multiple programs are executed by the equipment, makes Obtaining the equipment includes: that the text sequence of input is divided into multiple words, is obtained according to the substring of the multiple word and the multiple word Multiple semantic item-words pair;According to the multiple semantic item-word to establishing the first index;It is true according to the historical search data of user Determine candidate semantic item-word pair, and judges the candidate semantic item-word to whether correct;According to the modified result of judgement institute The first index is stated, to obtain the second index.
The text sequence of input is divided into multiple words by technical solution according to an embodiment of the present invention, according to multiple words and more The substring of a word obtains multiple semantic item-words pair, according to multiple semantic item-word to the first index is established, according to going through for user History search data determine candidate semantic item-word pair, and judge that whether correctly candidate semantic item-word to, is repaired according to the result of judgement Positive first index, to obtain the second index.It can be indexed in commodity and recall the stage, recall rate can either be guaranteed with as much as possible Commodity relevant to user's search term are recalled, and accuracy rate can be improved to filter out incoherent commodity as much as possible.
Above-mentioned specific embodiment, does not constitute a limitation on the scope of protection of the present invention.Those skilled in the art should be bright It is white, design requirement and other factors are depended on, various modifications, combination, sub-portfolio and substitution can occur.It is any Made modifications, equivalent substitutions and improvements etc. within the spirit and principles in the present invention, should be included in the scope of the present invention Within.

Claims (24)

1. a kind of method for establishing index characterized by comprising
The text sequence of input is divided into multiple words, multiple words and institute are obtained according to the substring of the multiple word and the multiple word Predicate corresponds to the mapping relations of semantic item;
The first index is established according to the mapping relations of the multiple word semantic item corresponding with institute's predicate;
The mapping relations that candidate word semantic item corresponding with the candidate word is determined by historical search data, according to the candidate The mapping relations amendment of word semantic item corresponding with the candidate word first index, obtains the second index.
2. the method according to claim 1, wherein the historical search data includes search term and corresponding use The text sequence that family is clicked,
The step of determining the mapping relations of candidate word semantic item corresponding with the candidate word by historical search data, comprising:
The text sequence clicked to described search word and the user segments, to obtain search term participle and corresponding text Sequence participle;
According to there are the search term of inclusion relation participle and texts in described search word participle and the corresponding text sequence participle This sequence participle determines the mapping relations of the candidate word semantic item corresponding with the candidate word.
3. the method according to claim 1, wherein the historical search data includes search term and corresponding use Family click text sequence, and by described search morphology at acquisition track,
The step of determining the mapping relations of candidate word semantic item corresponding with the candidate word by historical search data, comprising:
Obtain the neologisms being added in first index in advance;
From the acquisition track comprising the neologisms, there are the search terms of inclusion relation with the neologisms for lookup, and according to described Neologisms and there are the search terms of inclusion relation to determine reflecting for the candidate word corresponding semantic item with the candidate word with the neologisms Penetrate relationship.
4. the method according to claim 1, wherein the historical search data includes search term and corresponding use The text sequence that family is clicked,
The step of determining the mapping relations of candidate word semantic item corresponding with the candidate word by historical search data, comprising:
Obtain the neologisms being added in first index in advance;
It is determined according to the search term for the substring for belonging to the neologisms in the neologisms and search term corresponding with particular text sequence The mapping relations of candidate word semantic item corresponding with the candidate word, wherein the particular text sequence be include described new The text sequence that the user of word clicks.
5. the method according to claim 1, wherein determining candidate word and the candidate by historical search data Word corresponded to after the step of mapping relations of semantic item, comprising:
Judge whether the mapping relations of candidate word semantic item corresponding with the candidate word are correct, including:
Judge the totality of the corresponding semantic item of candidate word in the mapping relations of candidate word semantic item corresponding with the candidate word It the use of accounting whether is more than first threshold, if so, the mapping relations of candidate word semantic item corresponding with the candidate word are just Really, otherwise, the mapping relations of candidate word semantic item corresponding with the candidate word are wrong,
The candidate word corresponds to the ratio that the totality of semantic item is N1 and N2 using accounting, wherein N1 is all and includes described Candidate word correspond to semantic item search term it is corresponding and with the mapping relations phase of the candidate word corresponding semantic item with the candidate word In the click volume summation of the text sequence of pass, the sum of the click volume that semantic item is recalled is corresponded to by the candidate word, N2 be it is all with The corresponding text sequence click volume summation of search term for corresponding to semantic item including the candidate word.
6. the method according to claim 1, wherein determining candidate word and the candidate by historical search data Word corresponded to after the step of mapping relations of semantic item, comprising:
Judge whether the mapping relations of candidate word semantic item corresponding with the candidate word are correct, including:
The candidate word calculated in all mapping relations including candidate word semantic item corresponding with the candidate word is corresponding semantic Among the search term of item, the corresponding candidate word of the maximum preset quantity search term of utilization rate corresponds to semantic item use and accounts for Than;
Judge among the maximum preset quantity search term of the utilization rate, if there are at least one search terms to meet following item Part: the corresponding candidate word of the search term corresponds to semantic item and is greater than second threshold using accounting, and the volumes of searches of the search term is greater than Third threshold value, and the corresponding text sequence click volume of the search term is greater than the 4th threshold value, and the corresponding text sequence of the search term Clicking rate is greater than the 5th threshold value.
If there is at least one described search term, then the mapping relations of candidate word semantic item corresponding with the candidate word are being just Really, otherwise, the mapping relations of candidate word semantic item corresponding with the candidate word are wrong,
The corresponding candidate word of each search term corresponds to the ratio that semantic item is N3 and N4 using accounting, wherein N3 is and the search The click volume of word correspondence and text sequence relevant with the mapping relations of the corresponding semantic item of the candidate word to the candidate word is total With in, the sum of the click volume that semantic item is recalled is corresponded to by the candidate word, N4 is text sequence corresponding with search term click Measure summation.
7. the method according to claim 1, wherein determining candidate word and the candidate by historical search data Word corresponded to after the step of mapping relations of semantic item, comprising:
Judge whether the mapping relations of candidate word semantic item corresponding with the candidate word are correct, including:
Judge the corresponding semantic item of candidate word and candidate in the mapping relations of candidate word semantic item corresponding with the candidate word Whether word is similar, if to correspond to semantic item similar to the candidate word for the candidate word, the candidate word and the candidate word pair Answer the mapping relations of semantic item correct, otherwise, the mapping relations of candidate word semantic item corresponding with the candidate word are wrong.
8. the method according to the description of claim 7 is characterized in that judging candidate word semantic item corresponding with the candidate word Mapping relations in candidate word correspond to the semantic item step whether similar to candidate word, comprising:
It obtains the user in default historical time section and searches for data, it includes search term and user to search that the user, which searches for data, To text sequence the click volume or volumes of searches for presetting multiple classifications at different levels;
According to the click for presetting multiple classifications at different levels that described search word includes when the candidate word corresponds to semantic item Amount or volumes of searches generate it is described preset the corresponding semantic item vectors of the first candidate words at different levels, and include according to described search word The click volume for presetting multiple classifications at different levels when the candidate word or volumes of searches, which generate, described presets at different levels the One candidate term vector;
Calculate separately it is described preset between the corresponding semantic item vectors of the first candidate words at different levels and the first candidate term vector to Similarity is measured, described presets multiple primary vector similarities at different levels to obtain respectively corresponding;
Judge whether each primary vector similarity is greater than pre-set the 6th threshold value corresponding with corresponding pre-set level, If so, the candidate word corresponds to, semantic item is similar to the candidate word, and otherwise, the candidate word corresponds to semantic item and the time Select word dissimilar.
9. the method according to the description of claim 7 is characterized in that judging candidate word semantic item corresponding with the candidate word Mapping relations in candidate word correspond to the semantic item step whether similar to candidate word, comprising:
The second candidate word, which is generated, by preset model corresponds to the candidate term vector of semantic item vector sum second;
The vector similarity that second candidate word corresponds to described in semantic item vector sum the second candidate term vector is calculated, to obtain the Two vector similarities;
Judge whether the secondary vector similarity is greater than the 7th threshold value, if so, the candidate word corresponds to semantic item and the time Select word similar, otherwise, the candidate word corresponds to semantic item and the candidate word is dissimilar.
10. method according to claim 8 or claim 9, which is characterized in that the vector similarity is cosine similarity or outstanding person Card moral similarity.
11. method according to any one of claims 5 to 7, which is characterized in that according to the candidate word and the candidate Word corresponds to the step of mapping relations the described first index of amendment of semantic item, comprising:
When the mapping relations of candidate word semantic item corresponding with the candidate word are correct:
It, should if the mapping relations of candidate word semantic item corresponding with the candidate word are not present in first index Mapping relations are added to first index;
When the mapping relations of candidate word semantic item corresponding with the candidate word are wrong:
If this reflected there are the mapping relations of candidate word semantic item corresponding with the candidate word in first index Relationship is penetrated to delete from first index.
12. a kind of device for establishing index characterized by comprising
Word is to generation module, for the text sequence of input to be divided into multiple words, according to the multiple word and the multiple word Substring obtains the mapping relations of multiple words semantic item corresponding with institute's predicate;
Index establishes module, for establishing the first index according to the mapping relations of the multiple word semantic item corresponding with institute's predicate;
Correction module, for determining the mapping relations of candidate word semantic item corresponding with the candidate word by historical search data, To obtain the second rope according to the mapping relations of candidate word semantic item corresponding with candidate word amendment first index Draw.
13. device according to claim 12, which is characterized in that the historical search data includes search term and corresponding The text sequence that user clicks,
The correction module includes the first determining module, is used for:
The text sequence clicked to described search word and the user segments, to obtain search term participle and corresponding text Sequence participle;
According to there are the search term of inclusion relation participle and texts in described search word participle and the corresponding text sequence participle This sequence participle determines the mapping relations of the candidate word semantic item corresponding with the candidate word.
14. device according to claim 12, which is characterized in that the historical search data includes search term and corresponding User click text sequence, and by described search morphology at acquisition track,
The correction module includes the second determining module, is used for:
Obtain the neologisms being added in first index in advance;
From the acquisition track comprising the neologisms, there are the search terms of inclusion relation with the neologisms for lookup, and according to described Neologisms and there are the search terms of inclusion relation to determine reflecting for the candidate word corresponding semantic item with the candidate word with the neologisms Penetrate relationship.
15. device according to claim 12, which is characterized in that the historical search data includes search term and corresponding The text sequence that user clicks,
The correction module includes third determining module, is used for:
Obtain the neologisms being added in first index in advance;
It is determined according to the search term for the substring for belonging to the neologisms in the neologisms and search term corresponding with particular text sequence The mapping relations of candidate word semantic item corresponding with the candidate word, wherein the particular text sequence be include described new The text sequence that the user of word clicks.
16. device according to claim 12, which is characterized in that the correction module includes first judgment module, is used for:
Judge whether the mapping relations of candidate word semantic item corresponding with the candidate word are correct, including:
Judge the totality of the corresponding semantic item of candidate word in the mapping relations of candidate word semantic item corresponding with the candidate word It the use of accounting whether is more than first threshold, if so, the mapping relations of candidate word semantic item corresponding with the candidate word are just Really, otherwise, the mapping relations of candidate word semantic item corresponding with the candidate word are wrong,
The candidate word corresponds to the ratio that the totality of semantic item is N1 and N2 using accounting, wherein N1 is all and includes described Candidate word correspond to semantic item search term it is corresponding and with the mapping relations phase of the candidate word corresponding semantic item with the candidate word In the click volume summation of the text sequence of pass, the sum of the click volume that semantic item is recalled is corresponded to by the candidate word, N2 be it is all with The corresponding text sequence click volume summation of search term for corresponding to semantic item including the candidate word.
17. device according to claim 12, which is characterized in that institute's predicate includes the second judgment module to judgment module, For:
Judge whether the mapping relations of candidate word semantic item corresponding with the candidate word are correct, including:
The candidate word calculated in all mapping relations including candidate word semantic item corresponding with the candidate word is corresponding semantic Among the search term of item, the corresponding candidate word of the maximum preset quantity search term of utilization rate corresponds to semantic item use and accounts for Than;
Judge among the maximum preset quantity search term of the utilization rate, if there are at least one search terms to meet following item Part: the corresponding candidate word of the search term corresponds to semantic item and is greater than second threshold using accounting, and the volumes of searches of the search term is greater than Third threshold value, and the corresponding text sequence click volume of the search term is greater than the 4th threshold value, and the corresponding text sequence of the search term Clicking rate is greater than the 5th threshold value.
If there is at least one described search term, then the mapping relations of candidate word semantic item corresponding with the candidate word are being just Really, otherwise, the mapping relations of candidate word semantic item corresponding with the candidate word are wrong,
The corresponding candidate word of each search term corresponds to the ratio that semantic item is N3 and N4 using accounting, wherein N3 is and the search The click volume of word correspondence and text sequence relevant with the mapping relations of the corresponding semantic item of the candidate word to the candidate word is total With in, the sum of the click volume that semantic item is recalled is corresponded to by the candidate word, N4 is text sequence corresponding with search term click Measure summation.
18. device according to claim 12, which is characterized in that institute's predicate includes third judgment module to judgment module, For:
Judge whether the mapping relations of candidate word semantic item corresponding with the candidate word are correct, including:
Judge the corresponding semantic item of candidate word and candidate in the mapping relations of candidate word semantic item corresponding with the candidate word Whether word is similar, if to correspond to semantic item similar to the candidate word for the candidate word, the candidate word and the candidate word pair Answer the mapping relations of semantic item correct, otherwise, the mapping relations of candidate word semantic item corresponding with the candidate word are wrong.
19. device according to claim 18, which is characterized in that the third judgment module includes first similarity judgement Module is used for:
It obtains the user in default historical time section and searches for data, it includes search term and user to search that the user, which searches for data, To text sequence the click volume or volumes of searches for presetting multiple classifications at different levels;
According to the click for presetting multiple classifications at different levels that described search word includes when the candidate word corresponds to semantic item Amount or volumes of searches generate it is described preset the corresponding semantic item vectors of the first candidate words at different levels, and include according to described search word The click volume for presetting multiple classifications at different levels when the candidate word or volumes of searches, which generate, described presets at different levels the One candidate term vector;
Calculate separately it is described preset between the corresponding semantic item vectors of the first candidate words at different levels and the first candidate term vector to Similarity is measured, described presets multiple primary vector similarities at different levels to obtain respectively corresponding;
Judge whether each primary vector similarity is greater than pre-set the 6th threshold value corresponding with corresponding pre-set level, If so, the candidate word corresponds to, semantic item is similar to the candidate word, and otherwise, the candidate word corresponds to semantic item and the time Select word dissimilar.
20. device according to claim 18, which is characterized in that the third judgment module includes second similarity judgement Module is used for:
The second candidate word, which is generated, by preset model corresponds to the candidate term vector of semantic item vector sum second;
The vector similarity that second candidate word corresponds to described in semantic item vector sum the second candidate term vector is calculated, to obtain the Two vector similarities;
Judge whether the secondary vector similarity is greater than the 7th threshold value, if so, the candidate word corresponds to semantic item and the time Select word similar, otherwise, the candidate word corresponds to semantic item and the candidate word is dissimilar.
21. device described in 9 or 20 according to claim 1, which is characterized in that the vector similarity be cosine similarity or Jie Kade similarity.
22. device described in any one of 6 to 18 according to claim 1, which is characterized in that the correction module is also used to:
When the mapping relations of candidate word semantic item corresponding with the candidate word are correct:
It, should if the mapping relations of candidate word semantic item corresponding with the candidate word are not present in first index Mapping relations are added to first index;
When the mapping relations of candidate word semantic item corresponding with the candidate word are wrong:
If this reflected there are the mapping relations of candidate word semantic item corresponding with the candidate word in first index Relationship is penetrated to delete from first index.
23. a kind of server characterized by comprising
One or more processors;
Memory, for storing one or more programs,
When one or more of programs are executed by one or more of processors, so that one or more of processors Realize the method as described in any in claim 1-11.
24. a kind of computer-readable medium, is stored thereon with computer program, which is characterized in that described program is held by processor The method as described in any in claim 1-11 is realized when row.
CN201710946410.9A 2017-10-12 2017-10-12 Method and device for establishing index Active CN110019646B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710946410.9A CN110019646B (en) 2017-10-12 2017-10-12 Method and device for establishing index

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710946410.9A CN110019646B (en) 2017-10-12 2017-10-12 Method and device for establishing index

Publications (2)

Publication Number Publication Date
CN110019646A true CN110019646A (en) 2019-07-16
CN110019646B CN110019646B (en) 2021-10-15

Family

ID=67186485

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710946410.9A Active CN110019646B (en) 2017-10-12 2017-10-12 Method and device for establishing index

Country Status (1)

Country Link
CN (1) CN110019646B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110489417A (en) * 2019-07-25 2019-11-22 深圳壹账通智能科技有限公司 A kind of data processing method and relevant device
CN111061830A (en) * 2019-12-27 2020-04-24 深圳市元征科技股份有限公司 Method and device for processing automobile repair data

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030187844A1 (en) * 2002-02-11 2003-10-02 Mingjing Li Statistical bigram correlation model for image retrieval
CN101241512A (en) * 2008-03-10 2008-08-13 北京搜狗科技发展有限公司 Search method for redefining enquiry word and device therefor
CN102419778A (en) * 2012-01-09 2012-04-18 中国科学院软件研究所 Information searching method for discovering and clustering sub-topics of query statement
CN102930022A (en) * 2012-10-31 2013-02-13 中国运载火箭技术研究院 User-oriented information search engine system and method
WO2014123631A1 (en) * 2013-02-05 2014-08-14 Visa International Service Association Speaker verification
CN104636334A (en) * 2013-11-06 2015-05-20 阿里巴巴集团控股有限公司 Keyword recommending method and device
CN104933081A (en) * 2014-03-21 2015-09-23 阿里巴巴集团控股有限公司 Search suggestion providing method and apparatus
US20160179807A1 (en) * 2014-12-22 2016-06-23 Prathyusha Senthil Kumar Systems and methods for data mining and automated generation of search query rewrites
CN106610972A (en) * 2015-10-21 2017-05-03 阿里巴巴集团控股有限公司 Query rewriting method and apparatus
CN106776543A (en) * 2016-11-23 2017-05-31 上海智臻智能网络科技股份有限公司 New word discovery method, device, terminal and server
CN107168953A (en) * 2017-05-16 2017-09-15 电子科技大学 The new word discovery method and system that word-based vector is characterized in mass text

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030187844A1 (en) * 2002-02-11 2003-10-02 Mingjing Li Statistical bigram correlation model for image retrieval
CN101241512A (en) * 2008-03-10 2008-08-13 北京搜狗科技发展有限公司 Search method for redefining enquiry word and device therefor
CN102419778A (en) * 2012-01-09 2012-04-18 中国科学院软件研究所 Information searching method for discovering and clustering sub-topics of query statement
CN102930022A (en) * 2012-10-31 2013-02-13 中国运载火箭技术研究院 User-oriented information search engine system and method
WO2014123631A1 (en) * 2013-02-05 2014-08-14 Visa International Service Association Speaker verification
CN104636334A (en) * 2013-11-06 2015-05-20 阿里巴巴集团控股有限公司 Keyword recommending method and device
CN104933081A (en) * 2014-03-21 2015-09-23 阿里巴巴集团控股有限公司 Search suggestion providing method and apparatus
US20160179807A1 (en) * 2014-12-22 2016-06-23 Prathyusha Senthil Kumar Systems and methods for data mining and automated generation of search query rewrites
CN106610972A (en) * 2015-10-21 2017-05-03 阿里巴巴集团控股有限公司 Query rewriting method and apparatus
CN106776543A (en) * 2016-11-23 2017-05-31 上海智臻智能网络科技股份有限公司 New word discovery method, device, terminal and server
CN107168953A (en) * 2017-05-16 2017-09-15 电子科技大学 The new word discovery method and system that word-based vector is characterized in mass text

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
CHONGYANG ZHANG ET AL.: "Improving Chinese Word Segmentation by Adopting Self-Organized Maps of Character N-gram", 《CIPS-SIGHAN JOINT CONFERENCE ON CHINESE LANGUAGE PROCESSING》 *
周育彬: "基于关联数据的数字博物馆语义融合研究与实现", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
郑晶晶: "基于模糊集的关联规则及其应用", 《中国优秀硕士学位论文全文数据库 经济与管理科学辑》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110489417A (en) * 2019-07-25 2019-11-22 深圳壹账通智能科技有限公司 A kind of data processing method and relevant device
WO2021012553A1 (en) * 2019-07-25 2021-01-28 深圳壹账通智能科技有限公司 Data processing method and related device
CN110489417B (en) * 2019-07-25 2023-03-28 深圳壹账通智能科技有限公司 Data processing method and related equipment
CN111061830A (en) * 2019-12-27 2020-04-24 深圳市元征科技股份有限公司 Method and device for processing automobile repair data
CN111061830B (en) * 2019-12-27 2023-12-05 深圳市元征科技股份有限公司 Method and device for processing automobile repair data

Also Published As

Publication number Publication date
CN110019646B (en) 2021-10-15

Similar Documents

Publication Publication Date Title
US10186054B2 (en) Automatic image-based recommendations using a color palette
US10242396B2 (en) Automatic color palette based recommendations for affiliated colors
US9679532B2 (en) Automatic image-based recommendations using a color palette
CN102591942B (en) Method and device for automatic application recommendation
US9542704B2 (en) Automatic image-based recommendations using a color palette
US10235389B2 (en) Identifying data from keyword searches of color palettes
US10691744B2 (en) Determining affiliated colors from keyword searches of color palettes
CN103218719B (en) A kind of e-commerce website air navigation aid and system
US20150379006A1 (en) Automatic image-based recommendations using a color palette
US20150379001A1 (en) Automatic color validation of image metadata
CN103314371B (en) A kind of method and system of retrieval
CN103488788A (en) Method and device for recommending applications automatically
CN103744849A (en) Method and device for automatic recommendation application
CN102360364A (en) Automatic application recommendation method and device
CN109063000A (en) Question sentence recommended method, customer service system and computer readable storage medium
KR20210002410A (en) System, method and program of constructing dataset for training appearance recognition model
CN108288208A (en) The displaying object of image content-based determines method, apparatus, medium and equipment
CN107862562A (en) A kind of method and device that selection progress commercial product recommending is liked according to the picture of user
Bhattacharya et al. High utility itemset mining
CN110413767A (en) System and method based on spatial term rendering content
CN110019646A (en) A kind of method and apparatus for establishing index
CN108776677A (en) Creation method, equipment and the computer readable storage medium of parallel statement library
CN113988978A (en) Garment matching recommendation method, system and equipment based on knowledge graph
CN108960914A (en) Accurate information method for pushing, system and equipment
CN104881447A (en) Searching method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant