CN110019646B - Method and device for establishing index - Google Patents

Method and device for establishing index Download PDF

Info

Publication number
CN110019646B
CN110019646B CN201710946410.9A CN201710946410A CN110019646B CN 110019646 B CN110019646 B CN 110019646B CN 201710946410 A CN201710946410 A CN 201710946410A CN 110019646 B CN110019646 B CN 110019646B
Authority
CN
China
Prior art keywords
word
candidate
words
search
candidate word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710946410.9A
Other languages
Chinese (zh)
Other versions
CN110019646A (en
Inventor
焦晨晨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Jingdong Century Trading Co Ltd
Beijing Jingdong Shangke Information Technology Co Ltd
Original Assignee
Beijing Jingdong Century Trading Co Ltd
Beijing Jingdong Shangke Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Jingdong Century Trading Co Ltd, Beijing Jingdong Shangke Information Technology Co Ltd filed Critical Beijing Jingdong Century Trading Co Ltd
Priority to CN201710946410.9A priority Critical patent/CN110019646B/en
Publication of CN110019646A publication Critical patent/CN110019646A/en
Application granted granted Critical
Publication of CN110019646B publication Critical patent/CN110019646B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method and a device for establishing an index, and relates to the technical field of computers. One embodiment of the method comprises: dividing an input text sequence into a plurality of words, and obtaining the mapping relation between the plurality of words and the semantic items corresponding to the words according to the plurality of words and the substrings of the plurality of words; establishing a first index according to the mapping relation between the words and the semantic items corresponding to the words; determining the mapping relation between a candidate word and a semantic item corresponding to the candidate word according to historical search data of a user, and judging whether the mapping relation between the candidate word and the semantic item corresponding to the candidate word is correct or not; and correcting the first index according to the judgment result to obtain a second index. According to the method and the system, in the commodity index recall stage, the recall rate can be ensured so as to recall commodities related to the search terms of the user as much as possible, and the accuracy can be improved so as to filter out irrelevant commodities as much as possible.

Description

Method and device for establishing index
Technical Field
The present invention relates to the field of computer technologies, and in particular, to a method and an apparatus for creating an index.
Background
With the development of network technology and logistics technology, in recent years, caller companies rise rapidly, the number and the variety of commodities of a caller platform are exponentially increasing, a large platform has billions of commodities, and how to more efficiently and intelligently meet the search requirement of a user is a great challenge. The search words of users are various, the same things have different names such as spoken language, short names and abbreviations, the commodity names are usually more standard and have more written languages, and the problem often occurs when the two are directly matched with texts.
At the initial stage of E-commerce search, the commodity number is small, the index can be directly segmented according to characters, the commodity required by a user can be recalled to the greatest extent, and along with the gradual enrichment of the commodity number and the variety number, the method without considering the word sequence brings many problems, such as that the user searches for 'red wine', and drunk red wine and wine red clothes can also be recalled. Therefore, a method for dividing words and then building indexes is provided, the indexes are fixedly matched as dragon fruits, an index chain is formed for the whole words when the indexes are built, indexes do not need to be built for the fire, the dragon and the fruits, and the accuracy of search recall is greatly improved. But because the segmentation precision and the user search word and the commodity name cannot be directly matched, for example, in the commodity, "women's dress" is segmented into "women's | dress" normally, but the user frequently searches "dress", "summer dress female", and the text matching has no way to recall the commodity. Therefore, the indexing end adds more fine-grained words in the normal word segmentation result, for example, in the above example, the index field is added with 'woman, skirt', so that the user can conveniently recall the commodity named 'woman one-piece dress' when searching 'woman skirt', 'summer dress woman' and the like. These added finer grained words are called "semantic terms," which are the bridge between indexing by word to indexing by word.
Although some fine-grained words (semantic items) are added to the word segmentation result at the index end in the existing indexing scheme, some defects still exist in the practical search application, for example, when a user searches for a happy item, the user may obtain the search result of the happy item, and when the user searches for the happy item, recalling the goods related to the happy item obviously does not meet the expectation of the user.
In the process of implementing the invention, the inventor finds that at least the following problems exist in the prior art:
the current scheme has the problem of inaccurate recalled commodities.
Disclosure of Invention
In view of this, embodiments of the present invention provide a method and an apparatus for creating an index, which can ensure a recall rate to recall as many commodities related to a user search term as possible and improve an accuracy rate to filter out irrelevant commodities as much as possible in a commodity index recall stage.
To achieve the above object, according to an aspect of an embodiment of the present invention, there is provided a method of creating an index.
A method of building an index, comprising: dividing an input text sequence into a plurality of words, and obtaining the mapping relation between the plurality of words and the semantic items corresponding to the words according to the plurality of words and the substrings of the plurality of words; establishing a first index according to the mapping relation between the words and the semantic items corresponding to the words; and determining the mapping relation between the candidate word and the semantic item corresponding to the candidate word through historical search data, and correcting the first index according to the mapping relation between the candidate word and the semantic item corresponding to the candidate word to obtain a second index.
Optionally, the historical search data includes a search word and a corresponding text sequence clicked by the user, and the step of determining the mapping relationship between the candidate word and the semantic item corresponding to the candidate word through the historical search data includes: segmenting the search words and the text sequence clicked by the user to obtain search word segments and corresponding text sequence segments; and determining the mapping relation between the candidate word and the semantic item corresponding to the candidate word according to the search word segmentation and the text sequence segmentation which have the inclusion relation in the search word segmentation and the corresponding text sequence segmentation.
Optionally, the historical search data includes a search word, a text sequence clicked by a corresponding user, and a search track formed by the search word, and the step of determining, through the historical search data, a mapping relationship between the candidate word and a semantic item corresponding to the candidate word includes: acquiring new words which are added to the first index in advance; searching for a search word having an inclusion relation with the new word from a search track containing the new word, and determining a mapping relation between the candidate word and a semantic item corresponding to the candidate word according to the new word and the search word having the inclusion relation with the new word.
Optionally, the historical search data includes a search word and a corresponding text sequence clicked by the user, and the step of determining the mapping relationship between the candidate word and the semantic item corresponding to the candidate word through the historical search data includes: acquiring new words which are added to the first index in advance; and determining the mapping relation between the candidate word and the semantic item corresponding to the candidate word according to the new word and the search word of the substring of the new word in the search words corresponding to the specific text sequence, wherein the specific text sequence is the text sequence clicked by the user and comprises the new word.
Optionally, after the step of determining the mapping relationship between the candidate word and the semantic item corresponding to the candidate word through the history search data, the method includes: judging whether the mapping relation between the candidate word and the semantic item corresponding to the candidate word is correct, wherein the judging step comprises the following steps: judging whether the total usage proportion of semantic items corresponding to the candidate words in the mapping relation between the candidate words and the semantic items corresponding to the candidate words exceeds a first threshold, if so, judging that the mapping relation between the candidate words and the semantic items corresponding to the candidate words is correct, otherwise, judging that the mapping relation between the candidate words and the semantic items corresponding to the candidate words is wrong, wherein the total usage proportion of the semantic items corresponding to the candidate words is a ratio of N1 to N2, N1 is the sum of click amounts of all text sequences which correspond to search words comprising the semantic items corresponding to the candidate words and are related to the mapping relation between the candidate words and the semantic items corresponding to the candidate words, the sum of click amounts recalled by the semantic items corresponding to the candidate words, and N2 is the sum of click amounts of all text sequences corresponding to the search words comprising the semantic items corresponding to the candidate words.
Optionally, after the step of determining the mapping relationship between the candidate word and the semantic item corresponding to the candidate word through the history search data, the method includes: judging whether the mapping relation between the candidate word and the semantic item corresponding to the candidate word is correct, wherein the judging step comprises the following steps: calculating the use ratio of semantic items corresponding to candidate words respectively corresponding to the search words with the highest utilization rate among all the search words comprising the candidate words and the semantic items corresponding to the candidate words in the mapping relation of the candidate words and the semantic items corresponding to the candidate words; judging whether at least one search word in the preset number of search words with the maximum utilization rate meets the following conditions: the usage proportion of semantic items corresponding to the candidate words corresponding to the search word is larger than a second threshold, the search quantity of the search word is larger than a third threshold, the click quantity of a text sequence corresponding to the search word is larger than a fourth threshold, and the click rate of the text sequence corresponding to the search word is larger than a fifth threshold. If the at least one search word exists, the mapping relation between the candidate word and the semantic item corresponding to the candidate word is correct, otherwise, the mapping relation between the candidate word and the semantic item corresponding to the candidate word is wrong, the ratio of the semantic item corresponding to the candidate word is N3 to N4, wherein N3 is the sum of click amounts of text sequences corresponding to the search word and related to the mapping relation between the candidate word and the semantic item corresponding to the candidate word, the sum of click amounts recalled by the semantic item corresponding to the candidate word, and N4 is the sum of click amounts of the text sequences corresponding to the search word.
Optionally, after the step of determining the mapping relationship between the candidate word and the semantic item corresponding to the candidate word through the history search data, the method includes: judging whether the mapping relation between the candidate word and the semantic item corresponding to the candidate word is correct, wherein the judging step comprises the following steps: and judging whether the semantic item corresponding to the candidate word in the mapping relation between the candidate word and the semantic item corresponding to the candidate word is similar to the candidate word, if the semantic item corresponding to the candidate word is similar to the candidate word, the mapping relation between the candidate word and the semantic item corresponding to the candidate word is correct, otherwise, the mapping relation between the candidate word and the semantic item corresponding to the candidate word is wrong.
Optionally, the step of determining whether the semantic item corresponding to the candidate word in the mapping relationship between the candidate word and the semantic item corresponding to the candidate word is similar to the candidate word includes: acquiring user search data in a preset historical time period, wherein the user search data comprises search words and click quantity or search quantity of a plurality of categories of preset levels of a searched text sequence by a user; generating semantic item vectors corresponding to the first candidate words of the preset levels according to the click rate or search volume of the multiple classes of the preset levels when the search words comprise semantic items corresponding to the candidate words, and generating the first candidate word vectors of the preset levels according to the click rate or search volume of the multiple classes of the preset levels when the search words comprise the candidate words; respectively calculating vector similarity between semantic item vectors corresponding to the first candidate words of the preset levels and the first candidate word vectors to obtain a plurality of first vector similarities respectively corresponding to the preset levels; and judging whether the similarity of each first vector is greater than a preset sixth threshold corresponding to a corresponding preset level, if so, the semantic item corresponding to the candidate word is similar to the candidate word, and otherwise, the semantic item corresponding to the candidate word is not similar to the candidate word.
Optionally, the step of determining whether the semantic item corresponding to the candidate word in the mapping relationship between the candidate word and the semantic item corresponding to the candidate word is similar to the candidate word includes: generating semantic item vectors and second candidate word vectors corresponding to the second candidate words through a preset model; calculating the vector similarity of the semantic item vector corresponding to the second candidate word and the second candidate word vector to obtain a second vector similarity; and judging whether the second vector similarity is larger than a seventh threshold, if so, the semantic item corresponding to the candidate word is similar to the candidate word, otherwise, the semantic item corresponding to the candidate word is not similar to the candidate word.
Optionally, the vector similarity is a cosine similarity or a jaccard similarity.
Optionally, the step of correcting the first index according to the mapping relationship between the candidate word and the semantic item corresponding to the candidate word includes: when the mapping relation between the candidate word and the semantic item corresponding to the candidate word is correct: if the mapping relation between the candidate word and the semantic item corresponding to the candidate word does not exist in the first index, adding the mapping relation to the first index; when the mapping relation between the candidate word and the semantic item corresponding to the candidate word is wrong: and if the mapping relation between the candidate word and the semantic item corresponding to the candidate word exists in the first index, deleting the mapping relation from the first index.
According to another aspect of the embodiments of the present invention, an apparatus for creating an index is provided.
An apparatus for building an index, comprising: the word pair generation module is used for dividing an input text sequence into a plurality of words and obtaining the mapping relation between the plurality of words and the semantic items corresponding to the words according to the plurality of words and the substrings of the plurality of words; the index establishing module is used for establishing a first index according to the mapping relation between the words and the semantic items corresponding to the words; and the correction module is used for determining the mapping relation between the candidate word and the semantic item corresponding to the candidate word through historical search data, correcting the first index according to the mapping relation between the candidate word and the semantic item corresponding to the candidate word, and obtaining a second index.
Optionally, the historical search data includes search terms and corresponding text sequences clicked by the user, and the modification module includes a first determination module configured to: segmenting the search words and the text sequence clicked by the user to obtain search word segments and corresponding text sequence segments; and determining the mapping relation between the candidate word and the semantic item corresponding to the candidate word according to the search word segmentation and the text sequence segmentation which have the inclusion relation in the search word segmentation and the corresponding text sequence segmentation.
Optionally, the historical search data includes search terms and corresponding text sequences clicked by the user, and a search track formed by the search terms, and the modification module includes a second determination module configured to: acquiring new words which are added to the first index in advance; searching for a search word having an inclusion relation with the new word from a search track containing the new word, and determining a mapping relation between semantic items and words corresponding to the candidate word according to the new word and the search word having the inclusion relation with the new word.
Optionally, the historical search data includes search terms and corresponding text sequences clicked by the user, and the modification module includes a third determination module configured to: acquiring new words which are added to the first index in advance; and determining the mapping relation between the candidate word and the semantic item corresponding to the candidate word according to the new word and the search word of the substring of the new word in the search words corresponding to the specific text sequence, wherein the specific text sequence is the text sequence clicked by the user and comprises the new word.
Optionally, the modification module includes a first determination module, configured to: judging the candidate
Whether the mapping relation between the semantic item and the word is correct or not comprises the following steps: judging whether the total usage proportion of semantic items corresponding to the candidate words in the mapping relation between the candidate words and the semantic items corresponding to the candidate words exceeds a first threshold, if so, judging that the mapping relation between the candidate words and the semantic items corresponding to the candidate words is correct, otherwise, judging that the mapping relation between the candidate words and the semantic items corresponding to the candidate words is wrong, wherein the total usage proportion of the semantic items corresponding to the candidate words is a ratio of N1 to N2, N1 is the sum of click amounts of all text sequences which correspond to search words comprising the semantic items corresponding to the candidate words and are related to the mapping relation between the candidate words and the semantic items corresponding to the candidate words, the sum of click amounts recalled by the semantic items corresponding to the candidate words, and N2 is the sum of click amounts of all text sequences corresponding to the search words comprising the semantic items corresponding to the candidate words.
Optionally, the word pair determining module includes a second determining module, configured to: judging whether the mapping relation between the candidate word and the semantic item corresponding to the candidate word is correct, wherein the judging step comprises the following steps: calculating the use ratio of semantic items corresponding to candidate words respectively corresponding to the search words with the highest utilization rate among all the search words comprising the candidate words and the semantic items corresponding to the candidate words in the mapping relation of the candidate words and the semantic items corresponding to the candidate words; judging whether at least one search word in the preset number of search words with the maximum utilization rate meets the following conditions: the usage proportion of semantic items corresponding to the candidate words corresponding to the search word is larger than a second threshold, the search quantity of the search word is larger than a third threshold, the click quantity of a text sequence corresponding to the search word is larger than a fourth threshold, and the click rate of the text sequence corresponding to the search word is larger than a fifth threshold. If the at least one search word exists, the mapping relation between the candidate word and the semantic item corresponding to the candidate word is correct, otherwise, the mapping relation between the candidate word and the semantic item corresponding to the candidate word is wrong, the ratio of the semantic item corresponding to the candidate word is N3 to N4, wherein N3 is the sum of click amounts of text sequences corresponding to the search word and related to the mapping relation between the candidate word and the semantic item corresponding to the candidate word, the sum of click amounts recalled by the semantic item corresponding to the candidate word, and N4 is the sum of click amounts of the text sequences corresponding to the search word.
Optionally, the word pair determining module includes a third determining module, configured to: judging whether the mapping relation between the candidate word and the semantic item corresponding to the candidate word is correct, wherein the judging step comprises the following steps: and judging whether the semantic item corresponding to the candidate word in the mapping relation between the candidate word and the semantic item corresponding to the candidate word is similar to the candidate word, if the semantic item corresponding to the candidate word is similar to the candidate word, the mapping relation between the candidate word and the semantic item corresponding to the candidate word is correct, otherwise, the mapping relation between the candidate word and the semantic item corresponding to the candidate word is wrong.
Optionally, the third determining module includes a first similarity determining module, configured to: acquiring user search data in a preset historical time period, wherein the user search data comprises search words and click quantity or search quantity of a plurality of categories of preset levels of a searched text sequence by a user; generating semantic item vectors corresponding to the first candidate words of the preset levels according to the click rate or search volume of the multiple classes of the preset levels when the search words comprise semantic items corresponding to the candidate words, and generating the first candidate word vectors of the preset levels according to the click rate or search volume of the multiple classes of the preset levels when the search words comprise the candidate words; respectively calculating vector similarity between semantic item vectors corresponding to the first candidate words of the preset levels and the first candidate word vectors to obtain a plurality of first vector similarities respectively corresponding to the preset levels; and judging whether the similarity of each first vector is greater than a preset sixth threshold corresponding to a corresponding preset level, if so, the semantic item corresponding to the candidate word is similar to the candidate word, and otherwise, the semantic item corresponding to the candidate word is not similar to the candidate word.
Optionally, the third determining module includes a second similarity determining module, configured to: generating semantic item vectors and second candidate word vectors corresponding to the second candidate words through a preset model; calculating the vector similarity of the semantic item vector corresponding to the second candidate word and the second candidate word vector to obtain a second vector similarity; and judging whether the second vector similarity is larger than a seventh threshold, if so, the semantic item corresponding to the candidate word is similar to the candidate word, otherwise, the semantic item corresponding to the candidate word is not similar to the candidate word.
Optionally, the vector similarity is a cosine similarity or a jaccard similarity.
Optionally, the modification module is further configured to: when the mapping relation between the candidate word and the semantic item corresponding to the candidate word is correct: if the mapping relation between the candidate word and the semantic item corresponding to the candidate word does not exist in the first index, adding the mapping relation to the first index; when the mapping relation between the candidate word and the semantic item corresponding to the candidate word is wrong: and if the mapping relation between the candidate word and the semantic item corresponding to the candidate word exists in the first index, deleting the mapping relation from the first index.
According to yet another aspect of an embodiment of the present invention, a server is provided.
A server, comprising: one or more processors; memory storing one or more programs that, when executed by the one or more processors, cause the one or more processors to implement a method of establishing an index.
According to yet another aspect of an embodiment of the present invention, a computer-readable medium is provided.
A computer-readable medium, on which a computer program is stored, which, when being executed by a processor, carries out the method of establishing an index.
One embodiment of the above invention has the following advantages or benefits: dividing an input text sequence into a plurality of words, obtaining the mapping relation between the plurality of words and semantic items corresponding to the words according to the plurality of words and substrings of the plurality of words, establishing a first index according to the mapping relation between the plurality of words and the semantic items corresponding to the words, determining the mapping relation between candidate words and the semantic items corresponding to the candidate words through historical search data, and correcting the first index according to the mapping relation between the candidate words and the semantic items corresponding to the candidate words to obtain a second index. In the commodity index recall stage, the recall rate can be ensured so as to recall commodities related to the search terms of the user as much as possible, and the accuracy can be improved so as to filter out irrelevant commodities as much as possible.
Further effects of the above-mentioned non-conventional alternatives will be described below in connection with the embodiments.
Drawings
The drawings are included to provide a better understanding of the invention and are not to be construed as unduly limiting the invention. Wherein:
FIG. 1 is a schematic diagram of the main steps of a method for creating an index according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of the main modules of an apparatus for creating an index according to an embodiment of the present invention;
FIG. 3 is an exemplary system architecture diagram in which embodiments of the present invention may be employed;
FIG. 4 is a schematic block diagram of a computer system suitable for use with a server implementing an embodiment of the invention.
Detailed Description
Exemplary embodiments of the present invention are described below with reference to the accompanying drawings, in which various details of embodiments of the invention are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
Fig. 1 is a schematic diagram of main steps of a method for establishing an index according to an embodiment of the present invention.
For convenience of description, the mapping relationship between a word and the corresponding semantic item is expressed by a "semantic item-word pair", and the mapping relationship between a candidate word and the corresponding semantic item is expressed by a "candidate semantic item-word pair". Wherein, the semantic item in the "semantic item-word pair" is the corresponding semantic item of the word in the "semantic item-word pair", the word in the "candidate semantic item-word pair" can also be called as the candidate word, the candidate semantic item in the "candidate semantic item-word pair" is the corresponding semantic item of the candidate word in the "candidate semantic item-word pair" (i.e. the candidate word corresponding semantic item), the candidate word and the candidate word corresponding semantic item are respectively the word and the semantic item determined according to the history search data of the user, correspondingly, the "candidate semantic item-word pair" (i.e. the mapping relation between the candidate word and the candidate word corresponding semantic item) is the mapping relation between the word and the semantic item determined according to the history search data of the user, whether the mapping relation between the candidate word and the candidate word corresponding semantic item is correct or not needs to be determined by a certain judgment standard, the specific determination criteria is the criterion for determining whether a candidate semantic item-word pair is correct, which will be described in detail below), if a candidate semantic item-word pair is correct, that is, the mapping relationship between a candidate word and the semantic item corresponding to the candidate word is correct, the candidate semantic item is the semantic item of the candidate word, and if a candidate semantic item-word pair is incorrect, that is, the mapping relationship between a candidate word and the semantic item corresponding to the candidate word is incorrect, the candidate semantic item is not the semantic item of the candidate word.
As shown in fig. 1, the method for establishing an index according to the embodiment of the present invention mainly includes the following steps S101 to S104.
Step S101: dividing an input text sequence into a plurality of words, and obtaining a plurality of semantic item-word pairs according to the plurality of words and substrings of the plurality of words.
The text sequence entered may be commodity title information.
The method comprises the steps of segmenting an input text sequence according to a stored segmentation word list, namely segmenting the input text sequence into a plurality of independent words according to the stored segmentation word list, and thus obtaining a plurality of words. Finding all the substrings of each word in the plurality of words, wherein one word or a combination of a plurality of (including two) adjacently positioned words forming the substring of the word, for example, if one word is "Gaoligh", the substring of the word comprises: "Gao", "le", "Gao" and "le Gao". All the substrings of a word are regarded as semantic items of the word, so that a plurality of semantic item-word pairs are obtained, and the word in the semantic item-word pairs can also be called the original word of the semantic item.
Step S102: a first index is built from a plurality of semantic item-word pairs.
Since all sub-strings of a word are considered to be semantic items of the word, appropriate (i.e. correct mapping relationship between semantic items and word) semantic items are included, such as: tv-tv, addi-addis, where tv is a semantic item of tv and addi is a semantic item of addis, and goods that users recall using these semantic items as search terms are generally in line with the expectations of users. Some inappropriate (i.e. incorrect mapping of semantic items to words) semantic items are also included, such as: the pineapple-jackfruit and the happy high are obviously not expected to be searched by users searching for the pineapple and the happy high in the recalled commodities of the pineapple and the happy high, and the commodities searched and recalled by the users by using the semantic items as search words are generally not expected by the users.
The first index of the embodiments of the present invention may be an inverted index. By establishing a word-document matrix to store the mapping relation between the search words, the semantic items (the search words and the semantic items are equivalent to words) and the text sequence (the text sequence is equivalent to documents), the document set containing a word can be quickly found through inverted indexing.
After the first index is established, semantic item-word candidate pairs are determined through historical search data, so that the first index is corrected according to the semantic item-word candidate pairs to obtain a second index. Specifically, the following steps S103 and S104 may be included.
Step S103: and determining candidate semantic item-word pairs according to historical search data of the user, and judging whether the candidate semantic item-word pairs are correct or not.
The historical search data of the user can be obtained from user search click logs of an App (application program) end and a PC (personal computer) end, the historical search data can be query (search word) and click text sequence used by user search in the last three months, and the text sequence can be obtained commodity title (title) written and input by a merchant. The association relationship between the query (search term) and the clicked item title (title) information may be expressed as: query-Commodity title. A word in a candidate semantic item-word pair (i.e., a candidate word) may also be referred to as the original word of the candidate semantic item in the candidate semantic item-word pair.
Specifically, the candidate semantic item-word pairs may be determined in the following three ways (way a 1-way A3) according to the historical search data of the user.
Mode a 1: segmenting the search words and the text sequence clicked by the user to obtain search word segments and corresponding text sequence segments; and determining candidate semantic item-word pairs according to the search word participles and the text sequence participles which have inclusion relation in the search word participles and the corresponding text sequence participles. The historical search data may include, among other things, search terms and corresponding text sequences clicked by the user.
Before segmenting the search words and the text sequences clicked by the users, normalization processing can be carried out, specifically, normalization processing can be carried out on query (search words) and commodity title (titles), and the normalization processing can comprise processing of converting capital into lowercase, converting traditional into simplified form, converting full angles into half angles and the like.
And determining candidate semantic item-word pairs according to the search word participles and the text sequence participles with inclusion relations in the search word participles and the corresponding text sequence participles, specifically, scanning the commodity title, selecting a character string containing a certain word in the query to form the candidate semantic item-word pairs, and selecting only the character string with the word and the character string as substring inclusion relations without considering the character string with the same word and character string. For example:
the query is: lumbar vertebra traction;
the commercial title is: yonghuib 04 human body stretcher cervical vertebra and lumbar vertebra traction bed traction frame cervical vertebra tractor for household use;
then, by scanning the commodity title, the character string containing "traction" in the query in the commodity title is "traction table", "traction frame" or "traction apparatus", and the constituent semantic item-word pairs of the candidate are: traction-traction bed, traction-traction frame, traction-traction, and the "traction" in the commercial title is the same character string as the "traction" in the query, and is not selected.
The method a1 determines some semantic item-word mapping relationships by using the historical search data of the user and using query-commodity title in the historical search feedback data of the user, establishes corresponding semantic item-word pairs, and determines whether the mapping relationships between the semantic items and words in the first index are appropriate by determining whether the semantic item-word pairs are correct.
With the appearance of new things or the generation of new calling methods of existing things, some new words continuously appear, and when new words are added into the existing participle word list, semantic items of the new words need to be added correspondingly. For example, a new product, called "octopus foot", appears, and when the user searches for "octopus", the related product of "octopus foot" may be searched, so that the corresponding semantic item "octopus" needs to be added to "octopus foot", that is, the semantic item-word pair is determined as: an octopus-octopus foot. The new word may be recognized by various existing new word recognition algorithms, for example, the new word may be recognized by the following new word recognition algorithms: counting commodity titles in all commodity libraries, connecting adjacent characters below 4 characters into a candidate new word, judging the left and right information entropies and the degree of condensation of the candidate new word, judging the candidate new word to be an independent word when the left and right information entropies and the degree of condensation of the candidate new word are both larger than a certain threshold value, making a difference set between the independent word and a stored participle word list, and taking the independent word as the new word if the independent word does not exist in the stored participle word list. Mode a2 and mode A3 of embodiments of the present invention introduce methods for adding semantic items to recognized new words.
Mode a 2: acquiring new words which are added to the first index in advance; searching for a search word having an inclusion relation with the new word from the search track containing the new word, and determining a candidate semantic item-word pair according to the new word and the search word having the inclusion relation with the new word. The historical search data may include search terms and corresponding text sequences clicked by the user, as well as search tracks formed by the search terms.
After the new word is recognized, the new word can be added to the existing word segmentation list, and when the query and the commodity title are segmented, the new word in the word segmentation list is segmented into an independent word and added to the first index.
The search trajectory may be information of a user search session, where a user search session is related information of another search word that is replaced after the user searches for one word due to reasons such as not obtaining a desired search result (e.g., obtaining too few or too many search results), and the user search session indicates a series of related search behaviors of the user. For example, after a user searches with "shrimp" as a search word, the searched commodity title includes various shrimps, if the user's desired search result is "crayfish", the user changes the search word to "crayfish", and for example, after the user searches with "snack durian shortbread" as a search word, there are fewer returned (searched) commodities, or some commodity titles include both durian shortbread and durian cake, and the user desires to search for more related commodities, possibly changes the search word to "durian", and by searching for the information of session by the user, the user's search trajectory in the above two examples can be obtained: "shrimp → crayfish"; snack durian shortbread → snack durian. Searching for the search word having an inclusion relation with the new word from the search track containing the new word, and determining the new word and the search word having an inclusion relation with the new word as a candidate semantic item-word pair. For example, in the above two examples, assuming that "crayfish" and "durian" are new words, and in the search trajectory, "shrimp" and "durian shortbread" are search words having an inclusive relationship with the new words "crayfish" and "durian", respectively, the semantic item candidates-word pairs can be determined as: shrimp-crayfish; durian-durian shortbread.
Mode a 3: acquiring new words which are added to the first index in advance; and determining candidate semantic item-word pairs according to the new words and the search words of the substrings belonging to the new words in the search words corresponding to the specific text sequence, wherein the specific text sequence is a text sequence clicked by the user and comprises the new words, and the historical search data can comprise the search words and the corresponding text sequences clicked by the user.
The new word in the commodity title may be added to the stored participle vocabulary, so that when the query and the commodity title are participled, the new word in the participle vocabulary is divided into an independent word and added to the first index. The semantic item-word pair candidate corresponding to the new word may be determined based on historical search data of the user. Specifically, the historical search data of the last three months can be used, wherein the historical search data comprises search words and corresponding text sequences clicked by the user, the text sequences clicked by the user comprise some specific text sequences containing the new words, and candidate semantic item-word pairs are determined according to the new words and the search words of the substrings belonging to the new words in the search words corresponding to the specific text sequences. For example, if a new word is "adidas" in the commodity title, according to the method a3, a query used by the user in the last three months and the clicked commodity title may be used, a query belonging to the new word "adidas" in a query corresponding to the commodity title including the new word "adidas" in the commodity title clicked by the user may be selected, the substrings and "adidas" may be determined as a candidate semantic item-word pair, and a candidate semantic item-word pair may be determined as "adidas" if a query belonging to the substring of the new word "adidas" is "adis".
It should be noted that the method for determining semantic item-word pair candidates according to the historical search data of the user according to the embodiment of the present invention may adopt a combination of one or more of the above three ways (way a 1-way A3).
Whether the semantic item candidate-word pair determined by the above-described manner a1 to manner A3 is correct or not can be determined in one of the following three manners (manner B1 to manner B3).
The following illustrates a specific method for determining whether the semantic item candidate-word pair is correct through the manner B1, the manner B2 and the manner B3, respectively. For example, a semantic item candidate-word pair is "fan-fan", and the following is the relevant search and click data corresponding to the 20 search words (query) containing the semantic item candidate "fan" with the largest click volume of the commodity title.
1. A fan: 416734/494199(0.84)1690445| ceiling
2. The fan falls to the ground: 73795/80872(0.91)196629| non-magnetism
3. The beautiful fan: 67226/68015(0.99)147385| ceiling
4. A bladeless fan: 23185/47864(0.48)133882| ceiling
5. Desk-top fan: 23055/28270(0.82)94839| ceiling
6. An eimet fan: 18767/21304(0.88)43743| ceiling
7. Lattice force fan: 17340/21056(0.82)45162| ceiling
8. The fan is little: 14204/26035(0.55)132645| ceiling
9. A floor fan: 13445/14779(0.91)36242| non-magnetism
10. Desk-top fan: 11801/14479(0.82)43809| ceiling
11. A refrigeration fan: 9494/13310(0.71)25880| non-magnetism
12. A mini fan: 6645/31710(0.21)98000| ceiling
13. The usb fan: 6382/44415(0.14)113902| ceiling
14. A mute fan: 6114/8217(0.74)24376| ceiling
15. Muting a fan: 5648/7432(0.76)25114| ceiling
16. An Oerss fan: 3934/4628(0.85)12116| routing
17. The fan is mini: 3488/10803(0.32)34133| ceiling
18. Remote control of the fan: 3280/3534(0.93)7300| ceiling
19. An industrial fan: 3246/4319(0.75)13773| ceiling
20. American fan stand fan: 3211/3212(1)6540| purple
For the 20 queries, the click rate of the corresponding commodity title of a query is the sum of the click rates of the user for each commodity title when each commodity title is searched and clicked by using the query. The 20 query related search and click data corresponding meanings are (in the format of each of the 20 query related search and click data): "query 1: the value of N3/the value of N4 (the result of N3/N4) the search amount | | "of query1, where query1 represents any one of the following 20 queries, and N3/N4 is the ratio of N3 to N4, where query1 is: for example, a fan falls to the ground, and the corresponding piece of relevant search and click data is: "the fan falls to the ground: 73795/80872(0.91)196629| ", wherein 73795 is the value of N3, 80872 is the value of N4, the result of N3/N4 is 0.91, and the search amount of query1 is 196629. Specific meanings of N3 and N4 are explained in detail below.
N3/N4 is the usage percentage of the semantic term candidates corresponding to the predetermined number of search terms with the highest usage rate among all the search terms including the semantic term candidates in the semantic term-term candidate pairs, where N3 is the sum of the total clicks of the text sequences corresponding to the search term and related to the semantic term candidate-term candidate pair and the clicks recalled by the semantic term candidates. The text sequence corresponding to the search word and associated with the semantic item-word pair candidate is the text sequence associated with the semantic item-word pair candidate in the text sequence searched for and clicked on by the search word. The text sequence related to the candidate semantic item-word pair is a text sequence including a candidate semantic item or a candidate word in the candidate semantic item-word pair among text sequences searched and clicked using a search word (query) including the candidate semantic item in the candidate semantic item-word pair. The sum of the click-through amounts recalled by the candidate semantic items is the sum of the click-through amounts of the user on the text sequence containing the candidate word in the candidate semantic item-word pair in the text sequence related to the candidate semantic item-word pair searched and clicked through the search word comprising the candidate semantic item in the candidate semantic item-word pair. In connection with this example, the semantic item-word pair candidate is "fan-fan", assuming that 1000 commodity titles are searched through query "fan floor", wherein 800 commodity titles include "fan" or "electric fan" (for example, commodity title is "beautiful floor electric fan" or "beautiful floor electric fan"), 200 commodity titles do not include "electric fan" (certain commodity title is "beautiful floor cooling fan"), and then the 800 commodity titles including "fan" or "electric fan" are text sequences related to the candidate semantic item-word pair "fan-electric fan", whereas the 200 commercial titles that contain neither "fan" nor "fan" are not text sequences associated with the semantic item candidate-word pair "fan-fan" (the 200 commercial titles may be text sequences associated with other semantic item candidate-word pairs). Further, assuming that 300 commercial titles among the 800 commercial titles including "fan" or "electric fan" include "electric fan" and 500 commercial titles include "fan", according to the definition of N3, N3 is the sum of the click rates of the text sequences corresponding to the search term and related to the semantic item-word pair candidate, the sum of the click rates recalled by the semantic item candidate, and according to this example, the sum of the click rates of the 300 commercial titles including "electric fan" by the user is N3. In connection with the search and click data in this example, for example, "Fan landing: 73795/80872(0.91)196629| ", the sum of clicks of the 300 commercial titles including" electric fan "by the user is 73795, i.e., N3 is 73795.
N4 is the sum of the clicks of the text sequences corresponding to the search word, and specifically refers to the sum of the clicks of the text sequences in which at least one of the text sequences contains a candidate word in a candidate semantic item-word pair, among the text sequences searched and clicked by the search word. According to the present example, the semantic item candidate-word pair is "fan-fan", the query is "fan floor", and it is assumed that the query is used by the user to search 1000 commodity titles, at least one commodity title of the 1000 commodity titles should contain "fan", and then the total amount of clicks of the 1000 commodity titles is counted as N4. In connection with the search and click data in this example, "fan floor: 73795/80872(0.91)196629| ", the total number of clicks of the 1000 commercial titles is 80872, i.e., N4 is 80872. N3/N4, i.e., the ratio of N3 to N4, is: 73795/80872 is 0.91. Because the data obtained when all the commodity titles searched by using the query 'fan floor' are the commodity titles related to other candidate semantic item-word pairs is meaningless for determining whether the candidate semantic item-word pair is 'fan-fan' or not, at least one commodity title containing the candidate semantic item-word pair is 'fan-fan' in the 1000 commodity titles is limited, namely, at least one commodity title in the searched and clicked commodity titles is ensured to be the commodity title related to the candidate semantic item-word pair is 'fan-fan'.
The search volume of query1, that is, the total number of text sequences (such as title) searched by the user when using query1 as a search word, is combined with the search and click data in this example, for example, query1 is "fan landing, and the corresponding search and click data is" fan landing: 73795/80872(0.91)196629| ", the search amount of the search word" fan falls to the ground "is 196629.
Mode B1: and judging whether the total usage ratio of the candidate semantic items in the candidate semantic item-word pair exceeds a first threshold (set as threshold _ a), if so, judging that the candidate semantic item-word pair is correct, and otherwise, judging that the candidate semantic item-word pair is wrong. The total usage of the semantic term candidates is a ratio of N1 to N2, where N1 is the sum of clicks of all text sequences corresponding to the search term including the semantic term candidate and associated with the semantic term-term candidate pair, the sum of clicks recalled by the semantic term candidate, and N2 is the sum of clicks of all text sequences corresponding to the search term including the semantic term candidate.
With reference to this example, assuming that the search term including the semantic item candidate "fan" is W (this example shows 20 queries in which the click rate of the corresponding commodity title is the largest), and the user searches and clicks 5000 commodity titles using the W queries, where the commodity titles related to the semantic item candidate-word pairs (i.e., the commodity titles including "fan" or "electric fan" in the 5000 commodity titles) are 3000, and the commodity titles including "electric fan" in the 3000 commodity titles are 2000, then N1 is the sum of the click rates of the user on the 2000 commodity titles including "electric fan". N2 is the sum of the clicks of all text sequences corresponding to the search term including the candidate semantic item, and specifically refers to the sum of the clicks of the text sequences of the candidate word in which at least one of the text sequences includes the candidate semantic item-word pair, when searching the text sequences by all the search terms including the candidate semantic item. With reference to this example, if the user uses the W query searches and clicks 5000 commercial titles, at least one commercial title should include "electric fan", and if N1 is 916978 and N2 is 1339534, N1/N2, that is, the ratio of N1 to N2 is 916978/1339534 equal to 0.68.
Mode B2: calculating the usage proportion (set as query _ sem _ useRato) of the candidate semantic items corresponding to the search words with the maximum usage rate in all the search words including the candidate semantic items in the candidate semantic item-word pairs; judging whether at least one search word in the preset number of search words with the maximum utilization rate meets the following conditions: the usage percentage of the semantic item candidate corresponding to the search term is greater than a second threshold (set as threshold _ b _ user Rato), the search quantity (set as query _ search) of the search term is greater than a third threshold (set as threshold _ b _ pv), the text sequence click quantity (set as query _ click) corresponding to the search term is greater than a fourth threshold (set as threshold _ b _ click), and the text sequence click rate (set as ctr) corresponding to the search term is greater than a fifth threshold (set as threshold _ ctr). If the at least one search term is present, the semantic item-word pair candidate is correct, otherwise, the semantic item-word pair candidate is incorrect.
The ratio of the candidate semantic item usage corresponding to each search word is the ratio of N3 to N4, wherein N3 is the sum of clicks of text sequences corresponding to the search word and related to the candidate semantic item-word pair, the sum of clicks recalled by the candidate semantic items, and N4 is the sum of clicks of text sequences corresponding to the search word.
The specific meanings of N3 and N4 have been explained in detail above and are not described in detail here. The preset number in the mode B2 may be set to 10, and in connection with this example, assuming that the search term including the candidate semantic item "fan" is W, the usage occupation ratios N3/N4 of the candidate semantic items corresponding to each search term in the W search terms (i.e., the ratio of N3 to N4) are respectively calculated, then the calculated usage occupation ratios of the W candidate semantic items are ranked, and the usage occupation ratios of the candidate semantic items corresponding to 10 search terms with the largest value, i.e., the 10 search terms with the largest usage occupation ratio of the 10 candidate semantic items with the largest value, are selected. And, when judging that the semantic item-word pair candidate is correct, if at least one of the 10 search words satisfies: query _ search > threshold _ b _ pv, and query _ click > threshold _ b _ click, and query _ sem _ user Rato > threshold _ b _ user Rato, and ctr > threshold _ ctr, then the semantic item-word pair candidate is correct for "Fan-Fan", otherwise the semantic item-word pair candidate is incorrect.
Mode B3: and judging whether the candidate semantic item in the candidate semantic item-word pair is similar to the candidate word or not, if so, judging that the candidate semantic item-word pair is correct, otherwise, judging that the candidate semantic item-word pair is wrong.
In the method B3, whether the semantic item candidate in the semantic item candidate-word pair is similar to the candidate word may be determined in any one of two ways (method B3-1 and method B3-2).
Mode B3-1: acquiring user search data in a preset historical time period, wherein the user search data comprises search words and click quantity or search quantity of a plurality of categories of preset levels of a searched text sequence by a user; generating semantic item vectors corresponding to first candidate words of preset levels according to click quantities or search quantities of multiple classes of preset levels when the search word includes the candidate semantic item (in the embodiment of the present invention, the semantic item vectors corresponding to the candidate words may also be referred to as the candidate semantic item vectors for short, so the semantic item vector corresponding to the first candidate word may also be referred to as the first candidate semantic item vector), and generating the first candidate word vectors of preset levels according to click quantities or search quantities of multiple classes of preset levels when the search word includes the candidate word; respectively calculating vector similarity between a first candidate semantic item vector and a first candidate word vector of each preset level to obtain a plurality of first vector similarities respectively corresponding to each preset level; and judging whether the similarity of each first vector is greater than a preset sixth threshold corresponding to the corresponding preset level, if so, the candidate semantic item is similar to the candidate word, and otherwise, the candidate semantic item is not similar to the candidate word.
In the following, a detailed description mode B3-1 will be given by taking a semantic item candidate-word pair as "fan-fan" (where the semantic item candidate is "fan", and the original word of the semantic item candidate "fan" is "fan"), and the number of preset levels is three. The candidate semantic item and the candidate word in the candidate semantic item-word pair can be determined to be similar or not by using the search click data within a preset historical time period (e.g., six months) of the search terminal (e.g., APP terminal).
When the click rate of the original word and the semantic item is greater than a preset value (the preset value may be set as required, for example, the preset value is set as 100), a preset three-level first candidate semantic item vector (a first-level first candidate semantic item vector, a second-level first candidate semantic item vector, a third-level first candidate semantic item vector) is generated according to the click rate of a plurality of categories of three levels preset when the search word includes the candidate semantic item, and a preset three-level first candidate word vector (a first-level first candidate word vector, a second-level first candidate word vector, a third-level first candidate word vector) is generated according to the click rate of a plurality of categories of three levels preset when the search word includes the candidate word. Assuming that the first level has 5 categories, that is, there are 5 categories, the first candidate semantic item vector is set as v _ semantic _ cid1, and the first candidate word vector is set as v _ word _ cid 1. When the search word comprises a fan, clicking 3 primary categories, wherein the 3 primary categories are numbered (1,2 and 3) and the click quantity corresponding to the 3 primary categories is (100,50 and 50), forming a 5-dimensional vector which is (100,50,50,0 and 0), obtaining a primary first candidate semantic item vector v _ word _ cid1 which is (0.5,0.25,0.25,0 and 0) through normalization processing, clicking 2 primary categories when the search word comprises an electric fan, obtaining a2 primary category which is numbered (3 and 4) and the click quantity of which is (200 and 400), forming a 5-dimensional vector which is (0,200,400,0 and 0), and obtaining a primary first candidate word vector v _ word _ cid1 which is (0,0,0.333,0.667 and 0) after normalization processing. Similarly, it may be determined that two further levels (two or three) of the first candidate semantic item vector and the first candidate word vector are generated, the second level first candidate semantic item vector is set as v _ semantic _ cid2, the third level first candidate semantic item vector is set as v _ semantic _ cid3, the second level first candidate word vector is set as v _ word _ cid2, and the third level first candidate word vector is set as v _ word _ cid 3. And respectively calculating cosine similarity or Jacard similarity between the primary first candidate semantic item vector and the primary first candidate word vector, between the secondary first candidate semantic item vector and the secondary first candidate word vector, and between the tertiary first candidate semantic item vector and the tertiary first candidate word vector.
Take the example of calculating the cosine similarity between the first candidate semantic item vector and the first candidate word vector of each preset level, wherein:
the cosine similarity between the primary first candidate semantic item vector and the primary first candidate word vector (set as Sim _ cid1_ cos) is:
Sim_cid1_cos=cosine(v_word_cid1,v_semantic_cid1);
the cosine similarity between the second-level first candidate semantic item vector and the second-level first candidate word vector (set as Sim _ cid2_ cos) is:
Sim_cid2_cos=cosine(v_word_cid2,v_semantic_cid2);
the cosine similarity between the three-level first candidate semantic item vector and the three-level first candidate word vector (set as Sim _ cid3_ cos) is:
Sim_cid3_cos=cosine(v_word_cid3,v_semantic_cid3)。
taking calculating the jaccard similarity between the first candidate semantic item vector and the first candidate word vector of each preset level, taking calculating the jaccard similarity between the first candidate semantic item vector and the first candidate word vector of the third level (set as Sim _ cid3_ jacc) as an example:
Figure BDA0001431801520000221
wherein v isword3Is a first candidate word vector of three levels, vse3And the vector is a three-level first candidate semantic item vector, wherein, n represents the number of elements of the intersection of the three-level first candidate word vector and the three-level first candidate semantic item vector, and U represents the number of elements of the union of the three-level first candidate word vector and the three-level first candidate semantic item vector. According to this example, for example, when the search term includes "fan", 3 three classes are clicked, the vector of the first candidate term with three classes is (1,2,3), when the search term includes "fan", 4 classes are clicked, and the vector of the first candidate semantic item with three classes is(2,3,4,5), according to the formula for calculating the Jacard similarity: sim _ cid3_ jacc 2/5. Similarly, the Jacard similarity between the primary first candidate semantic item vector and the primary first candidate word vector (set as Sim _ cid1_ jacc), and the Jacard similarity between the secondary first candidate semantic item vector and the secondary first candidate word vector (set as Sim _ cid2_ jacc) can be calculated.
When the click rate of the original word or semantic item is less than or equal to the preset value (the preset value can be set according to needs, for example, the preset value is set as 100), a first candidate semantic item vector of preset levels can be generated according to the search rate of a plurality of classes of preset levels when the search word includes the candidate semantic item, and generating a first candidate word vector of each preset level according to the search quantity of a plurality of categories of each preset level when the search word comprises the candidate word, and then calculating the cosine similarity or Jacard similarity between the first-level first candidate semantic item vector and the first-level first candidate word vector, between the second-level first candidate semantic item vector and the second-level first candidate word vector, and between the third-level first candidate semantic item vector and the third-level first candidate word vector according to the introduced formula for calculating the cosine similarity or Jacard similarity between the vectors. Moreover, the search volume of the multiple categories at each preset level when the search word includes the candidate semantic item may be a search result of the search word (query) captured from an online, specifically, only a set number of the search results may be selected, for example, 300 of the search results are selected, the search volume of the multiple categories at each preset level of the corresponding product when the search word includes the candidate semantic item and the candidate word in the 300 product titles is counted, then, a first candidate semantic item vector and a first candidate word vector at each preset level are generated, and then, the cosine similarity or the Jacard similarity between the candidate semantic item vectors at each level and the first candidate word vector is calculated.
Mode B3-2: generating a semantic item vector (also called as a second candidate semantic item vector) corresponding to the second candidate word and a second candidate word vector through a preset model; computing a second semantic item candidate vector (set to v)word2vec_semantic) And a second candidate word vector (set to v)word2vec_word) Vector similarity of(set to Sim _ em (word)) to get the second vector similarity; and judging whether the second vector similarity is larger than a seventh threshold, if so, the candidate semantic item is similar to the candidate word, otherwise, the candidate semantic item is not similar to the candidate word.
The preset model can be a Word2vec model, the Word2vec model is a toolkit which is derived by Google (Google corporation) in 2013 and used for obtaining low-dimensional vector representations of words, and the Word2vec model can be used after being trained by using a commodity title in a stored commodity library.
Taking the second vector similarity as cosine similarity as an example, the second vector similarity is:
Sim_em(word,semantic)=cosine(word2vec_word,word2vec_semantic)
taking the second vector similarity as the jaccard similarity as an example, the second vector similarity is:
Figure BDA0001431801520000241
it should be noted that, in the embodiment of the present invention, any one of the manner B1, the manner B2, and the manner B3 may be adopted to determine whether the semantic item candidate-word pair determined by combining one or more of the manners a1 to A3 is correct, and in the manner B3, whether the semantic item candidate in the semantic item candidate-word pair is similar to the candidate word may be determined by any one of the manners B3-1 and B3-2.
The respective thresholds (i.e., the first threshold, the second threshold, the third threshold, the fourth threshold, the fifth threshold, the sixth threshold, and the seventh threshold) used in the modes B1 through B3 according to the embodiment of the present invention may be set to different specific values according to whether the candidate semantic item and the candidate word in the candidate semantic item-word pair are a product word or a non-product word, respectively. Candidate semantic items in the candidate semantic item-word pair and candidate words (i.e. original words of the candidate semantic item) which are respectively product words or non-product words may include the following four cases:
the candidate semantic item is a product word, and the original word is also a product word (marked as p-p);
the candidate semantic item is a product word, and the original word is a non-product word (marked as p-np);
candidate semantic item non-product words, the original word is a product word (denoted as np-p);
the candidate semantic item is a non-product word, and the original word is a non-product word (denoted as np-np).
Because the semantic relation between product words (such as durian-durian shortbread) and the semantic relation between non-product words (such as winter-winter) and the user search mode are different (the product words can be usually used as the search words independently, and the modifier words can not be used as the search words independently but are usually matched with the product words to be combined into the search words), whether the candidate semantic item-word pair is correct or not can be judged by respectively adopting any one of the three modes (mode B1-mode B3) according to the four conditions, and the threshold setting value of each condition is different in the four conditions so as to improve the filtering effect of the commodity.
In addition, product words or non-product words may be identified by looking up stored product vocabularies. Generally, the filtering effect of filtering irrelevant goods based on product words is good, and some product words can be mined from non-product words (in the np-np case) in order to perfect a product word list to accurately identify the product words. The method comprises the following steps: counting all single words in the candidate semantic item, and the original word ending with this single word, for example, { beans { kidney bean, bean milling, gold bean, kidney bean, cowpea, adzuki bean, soy bean, fried bean, kidney bean, green bean, cool bean, magic bean, dried bean, soy bean, rice bean, colored bean, lima bean, flower bean, hyacinth bean, honey bean, fortune bean, bamboo shoot bean, soluble bean, picking bean, milk bean, tonka bean, kidney bean, baked bean } }; { machine { simulator, cigarette making machine, engraving machine, walking machine, rotary cultivator, guillotine, gummed paper machine, deposit machine, chamfering machine, incubator, earth boring machine, young person machine, packer, fragrance machine, abrasive belt machine, umbrella bag machine, silk covering machine, raise boring machine, cold spray machine, trimmer, fat breaker, doll machine, football machine, sander, seeder, shield machine, photo printer, internal combustion engine, wax therapy machine, blank drawing machine } }, counting the number and proportion of product words in the original word ending with a certain character. The method comprises the steps of dividing original words into 2 words and 3 words and more, wherein the original words which are not product words can be manually searched and verified for the 2 words, and adding the original words into a white list after the original words are verified to be the product words, and for the 3 words, a condition can be set, for example, the number of the product words is 10 and the ratio of the product words is 25%, and words meeting the condition are added into the white list. The word ending with the word in the white list is selected from np-np as the newly identified product word.
Step S104: and correcting the first index according to the judgment result to obtain a second index.
Specifically, the step of correcting the first index according to the judgment result may specifically include: when the semantic item-word pair candidate is correct: if the mapping relation between the candidate semantic item and the candidate word in the candidate semantic item-word pair does not exist in the first index, adding the mapping relation to the first index; when a semantic item-word pair candidate is in error: if the mapping relation between the candidate semantic item and the candidate word in the candidate semantic item-word pair exists in the first index, the mapping relation is deleted from the first index, so that the improper mapping relation between some semantic items and words existing in the first index established based on the semantic item-word pair obtained in step S101 is deleted,
according to the embodiment of the invention, the mode A1 determines some semantic item-word mapping relations by using query-commodity title in user history search feedback data, establishes corresponding candidate semantic item-word pairs, the semantic item-word pair mapping relation determined by the mode A1 is the semantic item-word pair mapping relation existing in the first index, whether the semantic item-word pair mapping relation determined by the mode A1 is correct or not can be judged by any one of the mode B1, the mode B2 and the mode B3 (the mode B3-1 or the mode B3-2), if the mapping relation is wrong, the corresponding semantic item-word pair mapping relation in the first index is deleted, and if the mapping relation is correct, the mapping relation in the first index is retained. The method a2 searches for a search word having an inclusion relationship with a new word from a search track including the new word, and determines a semantic item-word pair candidate according to the new word and the search word having an inclusion relationship with the new word, the word in the semantic item-word pair candidate determined by the method a2 is a new word, the mapping relationship between the new word and the corresponding semantic item does not exist in the first index, whether the mapping relationship between the semantic item-word pair candidate determined by the method a2 is correct or not can be determined by any one of the method B1, the method B2, and the method B3 (the method B3-1 or the method B3-2), if the mapping relationship is correct, the mapping relationship between the semantic item-word pair candidate is added to the first index, and if the mapping relationship is wrong, the mapping relationship is not added to the first index. Mode a3 determines semantic item-word pair candidates based on the new word and a search word belonging to a substring of the new word among the search words corresponding to a specific text sequence that includes the text sequence clicked by the user of the new word. The word in the candidate semantic item-word pair determined by the method a3 is also a new word, the first index does not have a mapping relationship between the new word and the corresponding semantic item, whether the mapping relationship between the candidate semantic item-word pair determined by the method a3 is correct or not can be determined by any one of the methods B1, B2, and B3 (method B3-1 or B3-2), if the mapping relationship is correct, the mapping relationship between the candidate semantic item-word pair is added to the first index, and if the mapping relationship is wrong, the mapping relationship is not added to the first index. When determining whether the mapping relationship between the semantic item candidate and the word pair candidate determined by the method a2 and/or the method A3 is correct by any one of the methods B1, B2, and B3 (method B3-1 and B3-2), specific values of the second threshold, the third threshold, the fourth threshold, the fifth threshold, the sixth threshold, and the seventh threshold may be set according to the confidence requirement for the semantic item of the new word.
On one hand, the embodiment of the invention ensures fixedly matched words, so that an index item, such as 'one-piece dress', 'television set' and 'non-slip mat', can be directly established by the words without considering the searching mode of a user during indexing. By establishing semantic term pairs: (one-piece dress-one-piece dress), (TV-set), (antiskid/pad-slipmat), when the user searches for "one-piece dress longuette", "millet TV", "bathroom antiskid floor mat, the corresponding commodity with the name" one-piece dress long style "," millet intelligent cloud TV "," slipmat bathroom article ", etc. can still be recalled. On the other hand, the mapping relation between the new words and the corresponding semantic items is added to the first index, so that the accuracy rate in the recall stage can be improved. For example, the new words are "chicken wings", "bathroom cabinet", "mobile phone protective case", etc., and after the corresponding semantic items are added, it can be avoided that the search requirements of the user can be met only by dividing the words too sparsely (i.e. cutting the words into small pieces) in the word segmentation stage, thereby avoiding recalling some irrelevant commodities. For example, a semantic item "wood" is added to "chicken wing wood", and a user can recall a commodity containing "chicken wing wood" in a commodity title by searching for "wooden chopsticks", whereas if "chicken wing wood" does not have the semantic item "wood", only the term "chicken wing" and "wood" can be participled in the participle stage to ensure that the user recalls the commodity containing "chicken wing wood" in the commodity title when searching for "wooden chopsticks", which may result in the user recalling irrelevant commodities such as chopsticks, tables, tea trays and the like when searching for "chicken wing". Similarly, as another example, if "cowboy blue" has no corresponding semantic item of "blue", in the segmentation stage, the segmentation is required to be "cowboy" and "blue", and such segmentation may cause the user to recall a cowboy blue colored coat that is not cowboy material when searching for "cowboy coat". The semantic item is a compromise between word segmentation and word segmentation in the index, plays an important role in the index with fast commodity number increase, and is a bridge between diversified search words of a user and standardized commodity names of merchants. According to the embodiment of the invention, the mapping relation between the semantic item and the word in the first index is periodically corrected, and the mapping relation between the semantic item and the new word is added for the new word which continuously appears, so that the accuracy of commodity recall is improved under the condition of ensuring the commodity recall rate.
It should be noted that the embodiment of the present invention may delete the mapping relationship between the semantic item and the word with error in the first index periodically (in a first time period), and add the mapping relationship between the new word and the semantic item in the first index periodically (in a second time period), where the first time period and the second time period may be the same or different, for example, the semantic item-word candidate may be determined by means of a1 every three months, and when the determined semantic item-word candidate is determined to be incorrect, the mapping relationship between the semantic item and the word with error may be deleted, and the correct mapping relationship among the mapping relationships between the new word and the semantic item candidate may be added to the first index every 6 months. Or deleting the mapping relation of the semantic item to the word with errors in the first index every three months, and simultaneously adding the correct semantic item to word mapping relation of the new word to the first index. The embodiment of the invention uses the historical search data of the user to find out the mapping relation between improper semantic items and words in the first index through the continuous feedback of the user, deletes the mapping relation between the improper semantic items and words regularly, and increases the mapping relation between some new words and the semantic items regularly in the first index, so that the updating is repeated continuously, the first index is continuously corrected to obtain the second index, the search habit of the user can be well adapted, and the accuracy and the timeliness of the index are ensured.
Fig. 2 is a schematic diagram of main blocks of an apparatus for creating an index according to an embodiment of the present invention.
As shown in fig. 2, the apparatus 200 for establishing an index according to an embodiment of the present invention mainly includes: the system comprises a word pair generation module 201, an index establishing module 202, a word pair judgment module 203 and an index correction module 204.
The word pair generating module 201 is configured to divide an input text sequence into a plurality of words, and obtain a plurality of semantic item-word pairs according to the plurality of words and substrings of the plurality of words.
An index establishing module 202, configured to establish a first index according to a plurality of semantic item-word pairs.
And the word pair judging module 203 is used for determining candidate semantic item-word pairs according to the historical search data of the user and judging whether the candidate semantic item-word pairs are correct or not.
The index modification module 204 is configured to modify the first index according to the determination result to obtain a second index.
The word pair determining module 203 and the index correcting module 204 may be implemented by an independent module, i.e., a correcting module, which is mainly configured to determine a mapping relationship between a candidate word and a semantic item corresponding to the candidate word through historical search data, so as to correct the first index according to the mapping relationship between the candidate word and the semantic item corresponding to the candidate word, and obtain the second index. And the mapping relation between the candidate word and the semantic item corresponding to the candidate word is represented by a semantic item-word pair. The modification module may include a word pair determination module 203 and an index modification module 204.
According to an embodiment of the present invention, the historical search data may include search terms and corresponding text sequences clicked by the user, and the word pair determining module 203 may include a first determining module configured to: segmenting the search words and the text sequence clicked by the user to obtain search word segments and corresponding text sequence segments; and determining candidate semantic item-word pairs according to the search word participles and the text sequence participles which have inclusion relation in the search word participles and the corresponding text sequence participles.
According to another embodiment of the present invention, the historical search data may include search terms and corresponding text sequences clicked by the user, and a search track formed by the search terms, and the word pair determining module 203 may include a second determining module configured to: acquiring new words which are added to the first index in advance; searching for a search word having an inclusion relation with the new word from the search track containing the new word, and determining a candidate semantic item-word pair according to the new word and the search word having the inclusion relation with the new word.
According to another embodiment of the present invention, the historical search data may include search terms and corresponding text sequences clicked by the user, and the word pair determining module 203 may include a third determining module configured to: acquiring new words which are added to the first index in advance; and determining candidate semantic item-word pairs according to the new words and the search words of the substrings belonging to the new words in the search words corresponding to the specific text sequence, wherein the specific text sequence is a text sequence clicked by the user and comprises the new words.
According to another embodiment of the present invention, the word pair determining module 203 may include any one or more of the first determining module, the second determining module, and the third determining module.
According to an embodiment of the present invention, the word pair determining module 203 may include a first determining module configured to: and judging whether the total usage ratio of the candidate semantic items in the candidate semantic item-word pair exceeds a first threshold value, if so, judging that the candidate semantic item-word pair is correct, otherwise, judging that the candidate semantic item-word pair is wrong, wherein the total usage ratio of the candidate semantic items is the ratio of N1 to N2, N1 is the sum of click amounts of all text sequences corresponding to the search words comprising the candidate semantic items and related to the candidate semantic item-word pair, the sum of click amounts recalled by the candidate semantic items, and N2 is the sum of click amounts of all text sequences corresponding to the search words comprising the candidate semantic items.
According to another embodiment of the present invention, the word pair determining module 203 may include a second determining module, configured to: calculating the usage proportion of candidate semantic items corresponding to the search words with the maximum usage rate in the search words comprising the candidate semantic items in the candidate semantic item-word pairs; judging whether at least one search word in the preset number of search words with the maximum utilization rate meets the following conditions: the usage proportion of the candidate semantic item corresponding to the search word is larger than a second threshold, the search amount of the search word is larger than a third threshold, the click amount of the text sequence corresponding to the search word is larger than a fourth threshold, and the click rate of the text sequence corresponding to the search word is larger than a fifth threshold. And if at least one search word exists, the candidate semantic item-word pair is correct, otherwise, the candidate semantic item-word pair is wrong, the candidate semantic item corresponding to each search word uses a ratio of N3 to N4, wherein N3 is the sum of clicks of text sequences corresponding to the search word and related to the candidate semantic item-word pair, the sum of clicks recalled by the candidate semantic item, and N4 is the sum of clicks of the text sequences corresponding to the search word.
According to another embodiment of the present invention, the word pair determining module 203 may include a third determining module, configured to: and judging whether the candidate semantic item in the candidate semantic item-word pair is similar to the candidate word or not, if so, judging that the candidate semantic item-word pair is correct, otherwise, judging that the candidate semantic item-word pair is wrong.
The third judging module may include a first similarity judging module configured to: acquiring user search data in a preset historical time period, wherein the user search data comprises search words and click quantity or search quantity of a plurality of categories of preset levels of a searched text sequence by a user; generating a first candidate semantic item vector of each preset level according to the click rate or search rate of a plurality of classes of each preset level when the search word comprises the candidate semantic item, and generating a first candidate word vector of each preset level according to the click rate or search rate of a plurality of classes of each preset level when the search word comprises the candidate word; respectively calculating vector similarity between a first candidate semantic item vector and a first candidate word vector of each preset level to obtain a plurality of first vector similarities respectively corresponding to each preset level; and judging whether the similarity of each first vector is greater than a preset sixth threshold corresponding to the corresponding preset level, if so, enabling the candidate semantic item to be similar to the candidate word, and otherwise, enabling the candidate semantic item to be dissimilar to the candidate word.
Alternatively, the third determining module may include a second similarity determining module, configured to: generating a second candidate semantic item vector and a second candidate word vector through a preset model; calculating the vector similarity of the second candidate semantic item vector and the second candidate word vector to obtain a second vector similarity; and judging whether the second vector similarity is larger than a seventh threshold, if so, the candidate semantic item is similar to the candidate word, otherwise, the candidate semantic item is not similar to the candidate word.
The vector similarity may be cosine similarity or Jacard similarity.
The index modification module 204 may, when the semantic item-word pair candidate is correct: if the mapping relation between the candidate semantic item and the candidate word in the candidate semantic item-word pair does not exist in the first index, adding the mapping relation to the first index; when a semantic item-word pair candidate is in error: if the mapping relation between the candidate semantic item and the candidate word in the candidate semantic item-word pair exists in the first index, the mapping relation is deleted from the first index.
In addition, the detailed implementation of the index creating apparatus in the embodiment of the present invention is already described in detail in the above index creating method, and therefore, the repeated content is not described again here.
Fig. 3 illustrates an exemplary system architecture 300 to which the method of indexing or the apparatus for indexing may be applied, according to an embodiment of the present invention.
As shown in fig. 3, the system architecture 300 may include terminal devices 301, 302, 303, a network 304, and a server 305. The network 304 serves as a medium for providing communication links between the terminal devices 301, 302, 303 and the server 305. Network 304 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.
The user may use the terminal device 301, 302, 303 to interact with the server 305 via the network 304 to receive or send messages or the like. The terminal devices 301, 302, 303 may have various communication client applications installed thereon, such as shopping applications, web browser applications, search applications, instant messaging tools, mailbox clients, social platform software, and the like.
The terminal devices 301, 302, 303 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like.
The server 305 may be a server providing various services, such as a background management server providing support for shopping websites browsed by the user using the terminal devices 301, 302, 303. The background management server may analyze and perform other processing on the received data such as the product information query request, and feed back a processing result (e.g., product information) to the terminal device.
It should be noted that the method for creating an index provided by the embodiment of the present invention is generally executed by the server 305, and accordingly, the apparatus for creating an index is generally disposed in the server 305.
It should be understood that the number of terminal devices, networks, and servers in fig. 3 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
Referring now to FIG. 4, a block diagram of a computer system 400 suitable for use in implementing a server according to embodiments of the present application is shown. The server shown in fig. 4 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.
As shown in fig. 4, the computer system 400 includes a Central Processing Unit (CPU)601 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)402 or a program loaded from a storage section 408 into a Random Access Memory (RAM) 403. In the RAM 403, various programs and data necessary for the operation of the system 400 are also stored. The CPU 401, ROM 402, and RAM 403 are connected to each other via a bus 404. An input/output (I/O) interface 405 is also connected to bus 404.
The following components are connected to the I/O interface 405: an input section 406 including a keyboard, a mouse, and the like; an output section 407 including a display device such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section 408 including a hard disk and the like; and a communication section 409 including a network interface card such as a LAN card, a modem, or the like. The communication section 409 performs communication processing via a network such as the internet. A driver 410 is also connected to the I/O interface 405 as needed. A removable medium 411 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 410 as necessary, so that a computer program read out therefrom is mounted into the storage section 408 as necessary.
In particular, according to embodiments of the present disclosure, the processes described above with reference to the main step schematic may be implemented as computer software programs. For example, the disclosed embodiments of the invention include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the main step diagram. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 409, and/or installed from the removable medium 411. The above-described functions defined in the system of the present application are executed when the computer program is executed by a Central Processing Unit (CPU) 401.
It should be noted that the computer readable medium shown in the present invention can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
The principal step diagrams and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the main step diagrams or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or block diagrams, and combinations of blocks in the block diagrams or block diagrams, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The modules described in the embodiments of the present invention may be implemented by software or hardware. The described modules may also be provided in a processor, which may be described as: a processor comprises a word pair generation module 201, an index establishing module 202, a word pair judgment module 203 and an index correction module 204. The names of these modules do not form a limitation on the modules themselves in some cases, for example, the word pair generation module 201 may also be described as "a module for dividing an input text sequence into a plurality of words and obtaining a plurality of semantic item-word pairs from the plurality of words and substrings of the plurality of words".
As another aspect, the present invention also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be separate and not incorporated into the device. The computer readable medium carries one or more programs which, when executed by a device, cause the device to comprise: dividing an input text sequence into a plurality of words, and obtaining a plurality of semantic item-word pairs according to the plurality of words and substrings of the plurality of words; establishing a first index according to the plurality of semantic item-word pairs; determining candidate semantic item-word pairs according to historical search data of a user, and judging whether the candidate semantic item-word pairs are correct or not; and correcting the first index according to the judgment result to obtain a second index.
According to the technical scheme of the embodiment of the invention, an input text sequence is divided into a plurality of words, a plurality of semantic item-word pairs are obtained according to the plurality of words and substrings of the plurality of words, a first index is established according to the plurality of semantic item-word pairs, candidate semantic item-word pairs are determined according to historical search data of a user, whether the candidate semantic item-word pairs are correct or not is judged, and the first index is corrected according to a judgment result to obtain a second index. In the commodity index recall stage, the recall rate can be ensured so as to recall commodities related to the search terms of the user as much as possible, and the accuracy can be improved so as to filter out irrelevant commodities as much as possible.
The above-described embodiments should not be construed as limiting the scope of the invention. Those skilled in the art will appreciate that various modifications, combinations, sub-combinations, and substitutions can occur, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (22)

1. A method of creating an index, comprising:
dividing an input text sequence into a plurality of words, and obtaining the mapping relation between the plurality of words and the semantic items corresponding to the words according to the plurality of words and the substrings of the plurality of words;
establishing a first index according to the mapping relation between the words and the semantic items corresponding to the words;
determining a mapping relation between a candidate word and a semantic item corresponding to the candidate word through historical search data, and correcting the first index according to the mapping relation between the candidate word and the semantic item corresponding to the candidate word to obtain a second index; wherein, the correcting the first index according to the mapping relationship between the candidate word and the semantic item corresponding to the candidate word includes: when the mapping relation between the candidate word and the semantic item corresponding to the candidate word is correct, if the mapping relation between the candidate word and the semantic item corresponding to the candidate word does not exist in the first index, adding the mapping relation to the first index; and when the mapping relation between the candidate word and the semantic item corresponding to the candidate word is wrong, if the mapping relation between the candidate word and the semantic item corresponding to the candidate word exists in the first index, deleting the mapping relation from the first index.
2. The method of claim 1, wherein the historical search data comprises search terms and corresponding text sequences clicked by the user,
the step of determining the mapping relation between the candidate word and the semantic item corresponding to the candidate word through historical search data comprises the following steps:
segmenting the search words and the text sequence clicked by the user to obtain search word segments and corresponding text sequence segments;
and determining the mapping relation between the candidate word and the semantic item corresponding to the candidate word according to the search word segmentation and the text sequence segmentation which have the inclusion relation in the search word segmentation and the corresponding text sequence segmentation.
3. The method of claim 1, wherein the historical search data comprises search terms and corresponding text sequences clicked by the user, and search tracks formed by the search terms,
the step of determining the mapping relation between the candidate word and the semantic item corresponding to the candidate word through historical search data comprises the following steps:
acquiring new words which are added to the first index in advance;
searching for a search word having an inclusion relation with the new word from a search track containing the new word, and determining a mapping relation between the candidate word and a semantic item corresponding to the candidate word according to the new word and the search word having the inclusion relation with the new word.
4. The method of claim 1, wherein the historical search data comprises search terms and corresponding text sequences clicked by the user,
the step of determining the mapping relation between the candidate word and the semantic item corresponding to the candidate word through historical search data comprises the following steps:
acquiring new words which are added to the first index in advance;
and determining the mapping relation between the candidate word and the semantic item corresponding to the candidate word according to the new word and the search word of the substring of the new word in the search words corresponding to the specific text sequence, wherein the specific text sequence is the text sequence clicked by the user and comprises the new word.
5. The method of claim 1, wherein the step of determining the mapping relationship between the candidate word and the semantic item corresponding to the candidate word from the historical search data is followed by:
judging whether the mapping relation between the candidate word and the semantic item corresponding to the candidate word is correct, wherein the judging step comprises the following steps:
judging whether the total usage proportion of semantic items corresponding to the candidate words in the mapping relation between the candidate words and the semantic items corresponding to the candidate words exceeds a first threshold value, if so, judging that the mapping relation between the candidate words and the semantic items corresponding to the candidate words is correct, otherwise, judging that the mapping relation between the candidate words and the semantic items corresponding to the candidate words is wrong,
the total usage proportion of the semantic items corresponding to the candidate words is a ratio of N1 to N2, where N1 is the sum of clicks of all text sequences corresponding to the search words including the semantic items corresponding to the candidate words and related to the mapping relationship between the candidate words and the semantic items corresponding to the candidate words, the sum of clicks recalled by the semantic items corresponding to the candidate words, and N2 is the sum of clicks of all text sequences corresponding to the search words including the semantic items corresponding to the candidate words.
6. The method of claim 1, wherein the step of determining the mapping relationship between the candidate word and the semantic item corresponding to the candidate word from the historical search data is followed by:
judging whether the mapping relation between the candidate word and the semantic item corresponding to the candidate word is correct, wherein the judging step comprises the following steps:
calculating the use ratio of semantic items corresponding to candidate words respectively corresponding to the search words with the highest utilization rate among all the search words comprising the candidate words and the semantic items corresponding to the candidate words in the mapping relation of the candidate words and the semantic items corresponding to the candidate words;
judging whether at least one search word in the preset number of search words with the maximum utilization rate meets the following conditions: the usage proportion of semantic items corresponding to the candidate words corresponding to the search word is greater than a second threshold, the search quantity of the search word is greater than a third threshold, the click quantity of a text sequence corresponding to the search word is greater than a fourth threshold, and the click rate of the text sequence corresponding to the search word is greater than a fifth threshold;
if the at least one search word exists, the mapping relation between the candidate word and the semantic item corresponding to the candidate word is correct, otherwise, the mapping relation between the candidate word and the semantic item corresponding to the candidate word is wrong,
the usage proportion of the semantic item corresponding to the candidate word corresponding to each search word is a ratio of N3 to N4, wherein N3 is the sum of clicks of text sequences corresponding to the search word and related to the mapping relation between the candidate word and the semantic item corresponding to the candidate word, the sum of clicks recalled by the semantic item corresponding to the candidate word, and N4 is the sum of clicks of text sequences corresponding to the search word.
7. The method of claim 1, wherein the step of determining the mapping relationship between the candidate word and the semantic item corresponding to the candidate word from the historical search data is followed by:
judging whether the mapping relation between the candidate word and the semantic item corresponding to the candidate word is correct, wherein the judging step comprises the following steps:
and judging whether the semantic item corresponding to the candidate word in the mapping relation between the candidate word and the semantic item corresponding to the candidate word is similar to the candidate word, if the semantic item corresponding to the candidate word is similar to the candidate word, the mapping relation between the candidate word and the semantic item corresponding to the candidate word is correct, otherwise, the mapping relation between the candidate word and the semantic item corresponding to the candidate word is wrong.
8. The method according to claim 7, wherein the step of determining whether the semantic item corresponding to the candidate word in the mapping relationship between the candidate word and the semantic item corresponding to the candidate word is similar to the candidate word comprises:
acquiring user search data in a preset historical time period, wherein the user search data comprises search words and click quantity or search quantity of a plurality of categories of preset levels of a searched text sequence by a user;
generating semantic item vectors corresponding to the first candidate words of the preset levels according to the click rate or search volume of the multiple classes of the preset levels when the search words comprise semantic items corresponding to the candidate words, and generating the first candidate word vectors of the preset levels according to the click rate or search volume of the multiple classes of the preset levels when the search words comprise the candidate words;
respectively calculating vector similarity between semantic item vectors corresponding to the first candidate words of the preset levels and the first candidate word vectors to obtain a plurality of first vector similarities respectively corresponding to the preset levels;
and judging whether the similarity of each first vector is greater than a preset sixth threshold corresponding to a corresponding preset level, if so, the semantic item corresponding to the candidate word is similar to the candidate word, and otherwise, the semantic item corresponding to the candidate word is not similar to the candidate word.
9. The method according to claim 7, wherein the step of determining whether the semantic item corresponding to the candidate word in the mapping relationship between the candidate word and the semantic item corresponding to the candidate word is similar to the candidate word comprises:
generating semantic item vectors and second candidate word vectors corresponding to the second candidate words through a preset model;
calculating the vector similarity of the semantic item vector corresponding to the second candidate word and the second candidate word vector to obtain a second vector similarity;
and judging whether the second vector similarity is larger than a seventh threshold, if so, the semantic item corresponding to the candidate word is similar to the candidate word, otherwise, the semantic item corresponding to the candidate word is not similar to the candidate word.
10. The method according to claim 8 or 9, wherein the vector similarity is a cosine similarity or a Jacard similarity.
11. An apparatus for creating an index, comprising:
the word pair generation module is used for dividing an input text sequence into a plurality of words and obtaining the mapping relation between the plurality of words and the semantic items corresponding to the words according to the plurality of words and the substrings of the plurality of words;
the index establishing module is used for establishing a first index according to the mapping relation between the words and the semantic items corresponding to the words;
the correction module is used for determining the mapping relation between the candidate word and the semantic item corresponding to the candidate word through historical search data, and correcting the first index according to the mapping relation between the candidate word and the semantic item corresponding to the candidate word to obtain a second index; wherein, the correcting the first index according to the mapping relationship between the candidate word and the semantic item corresponding to the candidate word includes: when the mapping relation between the candidate word and the semantic item corresponding to the candidate word is correct, if the mapping relation between the candidate word and the semantic item corresponding to the candidate word does not exist in the first index, adding the mapping relation to the first index; and when the mapping relation between the candidate word and the semantic item corresponding to the candidate word is wrong, if the mapping relation between the candidate word and the semantic item corresponding to the candidate word exists in the first index, deleting the mapping relation from the first index.
12. The apparatus of claim 11, wherein the historical search data comprises search terms and corresponding text sequences clicked by a user,
the correction module comprises a first determination module configured to:
segmenting the search words and the text sequence clicked by the user to obtain search word segments and corresponding text sequence segments;
and determining the mapping relation between the candidate word and the semantic item corresponding to the candidate word according to the search word segmentation and the text sequence segmentation which have the inclusion relation in the search word segmentation and the corresponding text sequence segmentation.
13. The apparatus of claim 11, wherein the historical search data comprises search terms and corresponding text sequences clicked by a user, and search tracks formed by the search terms,
the correction module comprises a second determination module configured to:
acquiring new words which are added to the first index in advance;
searching for a search word having an inclusion relation with the new word from a search track containing the new word, and determining a mapping relation between the candidate word and a semantic item corresponding to the candidate word according to the new word and the search word having the inclusion relation with the new word.
14. The apparatus of claim 11, wherein the historical search data comprises search terms and corresponding text sequences clicked by a user,
the correction module comprises a third determination module configured to:
acquiring new words which are added to the first index in advance;
and determining the mapping relation between the candidate word and the semantic item corresponding to the candidate word according to the new word and the search word of the substring of the new word in the search words corresponding to the specific text sequence, wherein the specific text sequence is the text sequence clicked by the user and comprises the new word.
15. The apparatus of claim 11, wherein the modifying module comprises a first determining module configured to:
judging whether the mapping relation between the candidate word and the semantic item corresponding to the candidate word is correct, wherein the judging step comprises the following steps:
judging whether the total usage proportion of semantic items corresponding to the candidate words in the mapping relation between the candidate words and the semantic items corresponding to the candidate words exceeds a first threshold value, if so, judging that the mapping relation between the candidate words and the semantic items corresponding to the candidate words is correct, otherwise, judging that the mapping relation between the candidate words and the semantic items corresponding to the candidate words is wrong,
the total usage proportion of the semantic items corresponding to the candidate words is a ratio of N1 to N2, where N1 is the sum of clicks of all text sequences corresponding to the search words including the semantic items corresponding to the candidate words and related to the mapping relationship between the candidate words and the semantic items corresponding to the candidate words, the sum of clicks recalled by the semantic items corresponding to the candidate words, and N2 is the sum of clicks of all text sequences corresponding to the search words including the semantic items corresponding to the candidate words.
16. The apparatus of claim 11, wherein the modifying module comprises a second determining module configured to:
judging whether the mapping relation between the candidate word and the semantic item corresponding to the candidate word is correct, wherein the judging step comprises the following steps:
calculating the use ratio of semantic items corresponding to candidate words respectively corresponding to the search words with the highest utilization rate among all the search words comprising the candidate words and the semantic items corresponding to the candidate words in the mapping relation of the candidate words and the semantic items corresponding to the candidate words;
judging whether at least one search word in the preset number of search words with the maximum utilization rate meets the following conditions: the usage proportion of semantic items corresponding to the candidate words corresponding to the search word is greater than a second threshold, the search quantity of the search word is greater than a third threshold, the click quantity of a text sequence corresponding to the search word is greater than a fourth threshold, and the click rate of the text sequence corresponding to the search word is greater than a fifth threshold;
if the at least one search word exists, the mapping relation between the candidate word and the semantic item corresponding to the candidate word is correct, otherwise, the mapping relation between the candidate word and the semantic item corresponding to the candidate word is wrong,
the usage proportion of the semantic item corresponding to the candidate word corresponding to each search word is a ratio of N3 to N4, wherein N3 is the sum of clicks of text sequences corresponding to the search word and related to the mapping relation between the candidate word and the semantic item corresponding to the candidate word, the sum of clicks recalled by the semantic item corresponding to the candidate word, and N4 is the sum of clicks of text sequences corresponding to the search word.
17. The apparatus of claim 11, wherein the modifying module comprises a third determining module configured to:
judging whether the mapping relation between the candidate word and the semantic item corresponding to the candidate word is correct, wherein the judging step comprises the following steps:
and judging whether the semantic item corresponding to the candidate word in the mapping relation between the candidate word and the semantic item corresponding to the candidate word is similar to the candidate word, if the semantic item corresponding to the candidate word is similar to the candidate word, the mapping relation between the candidate word and the semantic item corresponding to the candidate word is correct, otherwise, the mapping relation between the candidate word and the semantic item corresponding to the candidate word is wrong.
18. The apparatus of claim 17, wherein the third determining module comprises a first similarity determining module configured to:
acquiring user search data in a preset historical time period, wherein the user search data comprises search words and click quantity or search quantity of a plurality of categories of preset levels of a searched text sequence by a user;
generating semantic item vectors corresponding to the first candidate words of the preset levels according to the click rate or search volume of the multiple classes of the preset levels when the search words comprise semantic items corresponding to the candidate words, and generating the first candidate word vectors of the preset levels according to the click rate or search volume of the multiple classes of the preset levels when the search words comprise the candidate words;
respectively calculating vector similarity between semantic item vectors corresponding to the first candidate words of the preset levels and the first candidate word vectors to obtain a plurality of first vector similarities respectively corresponding to the preset levels;
and judging whether the similarity of each first vector is greater than a preset sixth threshold corresponding to a corresponding preset level, if so, the semantic item corresponding to the candidate word is similar to the candidate word, and otherwise, the semantic item corresponding to the candidate word is not similar to the candidate word.
19. The apparatus of claim 17, wherein the third determining module comprises a second similarity determining module configured to:
generating semantic item vectors and second candidate word vectors corresponding to the second candidate words through a preset model;
calculating the vector similarity of the semantic item vector corresponding to the second candidate word and the second candidate word vector to obtain a second vector similarity;
and judging whether the second vector similarity is larger than a seventh threshold, if so, the semantic item corresponding to the candidate word is similar to the candidate word, otherwise, the semantic item corresponding to the candidate word is not similar to the candidate word.
20. The apparatus of claim 18 or 19, wherein the vector similarity is a cosine similarity or a Jacard similarity.
21. A server, comprising:
one or more processors;
a memory for storing one or more programs,
the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of any of claims 1-10.
22. A computer-readable medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1-10.
CN201710946410.9A 2017-10-12 2017-10-12 Method and device for establishing index Active CN110019646B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710946410.9A CN110019646B (en) 2017-10-12 2017-10-12 Method and device for establishing index

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710946410.9A CN110019646B (en) 2017-10-12 2017-10-12 Method and device for establishing index

Publications (2)

Publication Number Publication Date
CN110019646A CN110019646A (en) 2019-07-16
CN110019646B true CN110019646B (en) 2021-10-15

Family

ID=67186485

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710946410.9A Active CN110019646B (en) 2017-10-12 2017-10-12 Method and device for establishing index

Country Status (1)

Country Link
CN (1) CN110019646B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110489417B (en) * 2019-07-25 2023-03-28 深圳壹账通智能科技有限公司 Data processing method and related equipment
CN111061830B (en) * 2019-12-27 2023-12-05 深圳市元征科技股份有限公司 Method and device for processing automobile repair data
CN117688147B (en) * 2023-11-16 2024-07-12 深圳市灵智数字科技有限公司 Commodity information searching method and device and terminal equipment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102419778A (en) * 2012-01-09 2012-04-18 中国科学院软件研究所 Information searching method for discovering and clustering sub-topics of query statement
WO2014123631A1 (en) * 2013-02-05 2014-08-14 Visa International Service Association Speaker verification
CN106776543A (en) * 2016-11-23 2017-05-31 上海智臻智能网络科技股份有限公司 New word discovery method, device, terminal and server

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6901411B2 (en) * 2002-02-11 2005-05-31 Microsoft Corporation Statistical bigram correlation model for image retrieval
CN101241512B (en) * 2008-03-10 2012-01-11 北京搜狗科技发展有限公司 Search method for redefining enquiry word and device therefor
CN102930022B (en) * 2012-10-31 2015-11-25 中国运载火箭技术研究院 User oriented information search engine system and method
CN104636334A (en) * 2013-11-06 2015-05-20 阿里巴巴集团控股有限公司 Keyword recommending method and device
CN104933081B (en) * 2014-03-21 2018-06-29 阿里巴巴集团控股有限公司 Providing method and device are suggested in a kind of search
US9626430B2 (en) * 2014-12-22 2017-04-18 Ebay Inc. Systems and methods for data mining and automated generation of search query rewrites
CN106610972A (en) * 2015-10-21 2017-05-03 阿里巴巴集团控股有限公司 Query rewriting method and apparatus
CN107168953A (en) * 2017-05-16 2017-09-15 电子科技大学 The new word discovery method and system that word-based vector is characterized in mass text

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102419778A (en) * 2012-01-09 2012-04-18 中国科学院软件研究所 Information searching method for discovering and clustering sub-topics of query statement
WO2014123631A1 (en) * 2013-02-05 2014-08-14 Visa International Service Association Speaker verification
CN106776543A (en) * 2016-11-23 2017-05-31 上海智臻智能网络科技股份有限公司 New word discovery method, device, terminal and server

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Improving Chinese Word Segmentation by Adopting Self-Organized Maps of Character N-gram;Chongyang Zhang et al.;《CIPS-SIGHAN Joint Conference on Chinese Language Processing》;20101231;1-8 *
基于关联数据的数字博物馆语义融合研究与实现;周育彬;《中国优秀硕士学位论文全文数据库 信息科技辑》;20150615;I139-106 *

Also Published As

Publication number Publication date
CN110019646A (en) 2019-07-16

Similar Documents

Publication Publication Date Title
US11216861B2 (en) Color based social networking recommendations
US10186054B2 (en) Automatic image-based recommendations using a color palette
US10402917B2 (en) Color-related social networking recommendations using affiliated colors
US9679532B2 (en) Automatic image-based recommendations using a color palette
US9542704B2 (en) Automatic image-based recommendations using a color palette
US10120880B2 (en) Automatic image-based recommendations using a color palette
US10255295B2 (en) Automatic color validation of image metadata
US10235389B2 (en) Identifying data from keyword searches of color palettes
CN104750798B (en) Recommendation method and device for application program
US9418138B2 (en) Method and system for determining sets of variant items
CN104636371B (en) Information recommendation method and equipment
CN111400507B (en) Entity matching method and device
CN110019646B (en) Method and device for establishing index
CN112100396B (en) Data processing method and device
US20130212105A1 (en) Information processing apparatus, information processing method, and program
WO2018176913A1 (en) Search method and apparatus, and non-temporary computer-readable storage medium
US20130179418A1 (en) Search ranking features
CN111353838A (en) Method and device for automatically checking commodity category
US9710538B2 (en) Processing device, processing method, program, and recording medium
CN110162645A (en) Image search method, device and electronic equipment based on index
KR101985603B1 (en) Recommendation method based on tripartite graph
US20160148095A1 (en) Electronic calculating apparatus, method thereof and non-transitory machine-readable medium thereof for sensing context and recommending information
CN113892110A (en) Image-based dish recognition device and method
WO2021088589A1 (en) Task query method and device
CN116992053A (en) File query method, device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant