CN104424296A

CN104424296A - Query word classifying method and query word classifying device

Info

Publication number: CN104424296A
Application number: CN201310392966.XA
Authority: CN
Inventors: 丘鹏; 史可可
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2013-09-02
Filing date: 2013-09-02
Publication date: 2015-03-18
Anticipated expiration: 2033-09-02
Also published as: CN104424296B; HK1204685A1

Abstract

The invention provides a query word classifying method and a query word classifying device. The query word classifying method comprises the following steps of acquiring an inputted query word; generating a leaf industry category distribution vector which corresponds to the query word and comprises industry categories of at least one leaf node corresponding to the query word; determining the industry category of the leaf node of the query word according to category word vectors of the industry categories of various leaf nodes corresponding to the query word; and determining the industry category of the query word according to the determined industry category of the leaf node and a preset mapping relation of various industry categories. Various industry categories of various nodes correspond to corresponding category word vectors, and each category word vector comprises a plurality of category word. By the query word classifying method and the query word classifying device, an industry query result is accurate, and the query efficiency is high.

Description

Query term classification method and device

Technical Field

The invention relates to the technical field of computers, in particular to a query term classification method and a query term classification device.

Background

Electronic commerce is the whole business process of electronization, digitization and networking by utilizing computer technology, network technology and remote communication technology. With the development of electronic commerce, more and more electronic commerce specialized websites are produced. In the electronic commerce specialized website of today, how To perform the industrialized vertical subdivision is the work focus of the B2B (Business To Business) market.

On one hand, however, due to habits of users, the users often cannot actively link to the industry market page, so that the industry market page cannot effectively search for industries due to small user flow and no business development; on the other hand, in the existing search engine technology, when a user inputs a query word, only a direct result corresponding to the query word is returned, and industry division is not performed, for example, when the user inputs "alcohol", content related to the "alcohol" is directly returned, and related industry division processing is not performed according to the query word of the "alcohol", so that whether the user wants to query medical alcohol, edible alcohol or industrial alcohol cannot be distinguished, the user needs to further distinguish and accurately query, and inaccurate query is caused, and query efficiency is low.

Therefore, in the prior art, the industry which needs to be queried by the user cannot be determined according to the query word of the user, so that effective industry query cannot be performed, the query is not accurate enough, and the query efficiency is low.

Disclosure of Invention

The invention provides a query term classification method and a query term classification device, which are used for solving the problems that in the prior art, the industry which needs to be queried by a user cannot be determined according to the query term of the user, so that effective industry query cannot be carried out, the query is not accurate enough, and the query efficiency is low.

In order to solve the above problems, the present invention discloses a query term classification method, comprising: acquiring an input query word, and generating a leaf industry category distribution vector corresponding to the query word, wherein the leaf industry category distribution vector comprises industry categories of at least one leaf node corresponding to the query word, each industry category of each level of nodes corresponds to a category word vector, and each category word vector comprises a plurality of category words; determining the industry category of the leaf node to which the query word belongs according to the industry category word vector of each leaf node corresponding to the query word; and determining the industry category where the query word is located according to the determined industry category of the leaf node and the preset mapping relation of various levels of industry categories.

In order to solve the above problem, the present invention also discloses a query term classification device, comprising: the system comprises a first generation module, a second generation module and a third generation module, wherein the first generation module is used for acquiring an input query word and generating a leaf industry category distribution vector corresponding to the query word, the leaf industry category distribution vector comprises industry categories of at least one leaf node corresponding to the query word, each industry category of each level of nodes corresponds to a category word vector, and each category word vector comprises a plurality of category words; the determining module is used for determining the industry categories of the leaf nodes to which the query words belong according to the industry category word vectors of the leaf nodes corresponding to the query words; and the first classification module is used for determining the industry category where the query word is located according to the determined industry category of the leaf node and the preset mapping relation of all levels of industry categories.

Compared with the prior art, the invention has the following advantages:

the leaf industry category distribution vector corresponding to the query words is generated, the leaf industry category distribution vector comprises the industry categories of leaf nodes corresponding to one or more query words, and each industry category at each level has a category word vector comprising a plurality of category words; according to the relation between the query word and the category word vector of the industry category of the leaf node, determining the accurate industry category of the leaf node to which the query word belongs; and determining the industry category where the query word is located according to the preset mapping relation of all levels of industry categories. The mapping relation of all levels of industry categories is divided into a plurality of levels from high to low, the bottom layer is the industry category of the leaf node, the top layer is the industry category of the root node, one or more industry categories of intermediate level nodes may exist between the industry category of the leaf node and the industry category of the root node, and the industry category of the intermediate level node may not exist. After the industry category of the leaf node to which the query word belongs is accurately determined, the industry category to which the query word belongs can be determined in any appropriate mode according to the industry category mapping relation; and classifying the query words into the determined corresponding industry categories. According to the invention, the industry which needs to be inquired by the user can be determined according to the inquiry words, and then the information corresponding to the inquiry words in the industry is provided for the user to carry out accurate in-industry inquiry, so that effective in-industry inquiry is realized, and the inquiry efficiency and the search experience of the user are also improved; meanwhile, the industry categories of the query words are accurately positioned from the leaf nodes, so that the industry query result is more accurate, and the query efficiency is higher.

Drawings

FIG. 1 is a flowchart illustrating steps of a method for classifying query terms according to a first embodiment of the present invention;

FIG. 2 is a flowchart illustrating steps of a method for classifying query terms according to a second embodiment of the present invention;

FIG. 3 is a flowchart illustrating steps of a method for classifying query terms according to a third embodiment of the present invention;

FIG. 4 is a schematic flow chart illustrating a process of generating a correspondence between a query term and an industry leaf category in the embodiment shown in FIG. 3;

FIG. 5 is a diagram of an industry category mapping at various levels in the embodiment shown in FIG. 3;

FIG. 6 is a flowchart of one specific implementation of the embodiment shown in FIG. 3;

fig. 7 is a block diagram showing a structure of a query word classifying device according to a fourth embodiment of the present invention;

fig. 8 is a schematic diagram of a system architecture for performing industrial recognition by using a query term classification device according to a fifth embodiment of the present invention;

FIG. 9 is a schematic diagram of an industrialized search pullout in the embodiment shown in FIG. 8.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

Example one

Referring to fig. 1, a flowchart illustrating steps of a query term classification method according to a first embodiment of the present invention is shown.

The query term classification method of the embodiment comprises the following steps:

step S102: and acquiring an input query term, and generating a leaf industry category distribution vector corresponding to the query term.

The leaf industry category distribution vector corresponding to the query term comprises the industry categories of at least one leaf node corresponding to the query term. The industry categories may have multiple levels of nodes, each level of nodes may have multiple industry categories, each industry category of each level of nodes corresponds to a category word vector, and each category word vector includes multiple category words.

In the multi-level industry category, when the multi-level industry category is represented by a tree structure, leaf nodes of the tree structure are nodes of the lowest level, and the industry category of the level node is the industry category of the leaf node. One query term may correspond to the industry category of one leaf node, or may correspond to the industry categories of a plurality of leaf nodes, and the industry categories of one or more leaf nodes form a leaf industry category distribution vector corresponding to the query term.

When determining the industry categories of the leaf nodes corresponding to the query terms, a person skilled in the art can process the query terms input by the user according to actual conditions and by referring to the existing query term processing mode and correspond to the industry categories of the corresponding leaf nodes. Of course, if the query term input by the user is more standard, the query term can directly correspond to the industry category of the corresponding leaf node without processing the query term.

Step S104: and determining the industry categories of the leaf nodes to which the query words belong according to the industry category word vectors of the leaf nodes corresponding to the query words.

As described above, one query term may correspond to the industry categories of a plurality of leaf nodes, and the query term is accurately corresponding to one or a small number of industry categories of the set leaf nodes, so that on one hand, the accuracy of industry query and classification is increased, and on the other hand, the efficiency of industry query and classification can be improved due to the reduction of subsequent query and classification data volume.

In this embodiment, the query term and the category term in the category term vector of the industry category of each leaf node are subjected to corresponding matching and processing operations, so that the industry category of the leaf node to which the query term should belong can be accurately determined. For example, the industry category of the leaf node to which the query word belongs is determined according to the matching degree or similarity between the query word and the category word vector, or the industry category of the leaf node to which the query word belongs is determined by setting the weight of the category word in the category word vector and the weight of the category word matched with the query word, and the like.

Step S106: and determining the industry category where the query word is located according to the determined industry category of the leaf node and the preset mapping relation of various levels of industry categories.

The industry categories may have multiple levels, each level has a corresponding relationship, and the corresponding relationship between the levels forms a mapping relationship of the industry categories of each level.

And for the industry categories of each leaf node in the leaf industry category distribution vector, the leaf industry category distribution vector has one or more corresponding superior industry categories. After the industry category of the leaf node to which the query word belongs is determined, the industry category to which the query word belongs can be determined according to the superior industry category corresponding to the industry category of the leaf node, for example, after the industry category of the leaf node to which the query word belongs is determined to be the one-piece dress, the direct superior industry category corresponding to the industry category of the one-piece dress is the women's dress, and the superior industry category of the women's dress is the clothing. Therefore, when the query word input by the user is "white dress", if it is determined that the industry category of the leaf node to which the query word belongs should be "dress", then one possible way is to correspond to the industry category of the root node as a default, and then it is determined that the industry category to which the "white dress" belongs is "dress". Of course, without limitation, in practical applications, those skilled in the art may also make appropriate rules according to actual situations to determine the industry category where the query word is located, for example, by setting rules related to gender, determine the industry category where the "white dress" is located as "women's dress" or the like.

Through the embodiment, a leaf industry category distribution vector corresponding to the query word is generated, the leaf industry category distribution vector comprises industry categories of leaf nodes corresponding to one or more query words, and each industry category at each level has a category word vector comprising a plurality of category words; according to the relation between the query word and the category word vector of the industry category of the leaf node, determining the accurate industry category of the leaf node to which the query word belongs; and determining the industry category where the query word is located according to the preset mapping relation of all levels of industry categories. The mapping relation of all levels of industry categories is divided into a plurality of levels from high to low, the bottom layer is the industry category of the leaf node, the top layer is the industry category of the root node, one or more industry categories of intermediate level nodes may exist between the industry category of the leaf node and the industry category of the root node, and the industry category of the intermediate level node may not exist. After the industry category of the leaf node to which the query word belongs is accurately determined, the industry category to which the query word belongs can be determined in any appropriate mode according to the industry category mapping relation; and classifying the query words into the determined corresponding industry categories. By the embodiment, the industry which needs to be queried by the user can be determined according to the query word, and information corresponding to the query word in the industry is provided for the user to perform accurate in-industry query, so that effective in-industry query is realized, and query efficiency and search experience of the user are improved; meanwhile, the industry categories of the query words are accurately positioned from the leaf nodes, so that the industry query result is more accurate, and the query efficiency is higher.

Example two

Referring to fig. 2, a flowchart illustrating steps of a query term classification method according to a second embodiment of the present invention is shown.

step S202: and generating the corresponding relation between the query words and the industry categories of the leaf nodes.

The corresponding relation between the query words and the industry categories of the leaf nodes can be generated in the following mode: acquiring search click information of a user, wherein the search click information comprises: query word information input by a user and commodity information clicked by the user according to a query result; and acquiring the corresponding relation between the query word and the industry leaf category according to the search click information and the corresponding relation between the commodity and the category to which the commodity belongs.

In specific implementation, at least one of the following ways may be adopted to generate the corresponding relationship between the query term and the industry class of the leaf node.

The method includes the steps of obtaining a user search click log, wherein the user search click log includes search click information of a plurality of users, and the search click information includes: query word information input by a user, commodity information clicked by the user according to a query result, and category information to which commodities belong; carrying out statistical analysis on search click information in a user search click log to obtain a corresponding relation between query word information and commodity information and a corresponding relation between the commodity information and the category information to which the commodity belongs; and generating the corresponding relation between the query word and the industry leaf category according to the corresponding relation between the query word information and the commodity information and the corresponding relation between the commodity information and the category information to which the commodity belongs.

And in a second mode, acquiring search click information of the user, wherein the search click information comprises: query word information input by a user and commodity information clicked by the user according to a query result; acquiring commodity information and the category information to which the commodity belongs from the commodity information title, and determining the corresponding relation between the commodity information and the category information to which the commodity belongs; and acquiring the corresponding relation between the query word and the industry leaf category according to the search click information and the corresponding relation between the commodity information and the category information to which the commodity belongs.

It should be noted that the correspondence between the query term and the industry category of the leaf node may be implemented by the implementer of the present invention, or may be provided by a third party, for example, generated by a third party generation tool, or implemented by accessing third party stored data.

Step S204: acquiring an input query word, and determining the industry category of at least one leaf node corresponding to the query word and meeting a set rule according to the corresponding relation between the query word and the industry category of the leaf node; and generating a leaf industry category distribution vector corresponding to the query word by using the determined industry categories of the leaf nodes.

Wherein, the setting rule can be set by those skilled in the art appropriately according to the actual situation, so as to select one or more or all industry categories from the industry categories of the plurality of leaf nodes corresponding to the query term.

As can be seen, the leaf industry category distribution vector includes the industry category of at least one leaf node corresponding to the query word, and for all the level nodes and all the industry categories, each industry category of each level node corresponds to one category word vector, and each category word vector includes a plurality of category words. Preferably, each category word in each category word vector may be weighted.

For example, when the user inputs "korean chiffon dress", the system corresponds the query words to the industry categories of the four leaf nodes "one-piece dress", "stock dress", "princess dress" and "japanese-korean dress" according to the established correspondence between the query words and the industry categories of the leaf nodes. Each industry category corresponds to a category word vector, for example, category words such as 'one-piece dress', 'skirt', 'chiffon' and 'Korean' are arranged under the category of 'one-piece dress' to form the category word vector corresponding to the 'one-piece dress'; the category words of ' stock women ' such as ' stock women ' dress ', ' skirt ', ' one-piece dress ', ' chiffon ' dress ', ' women ' coat ', ' women ' dress ' and the like form category word vectors corresponding to ' stock women ' dress '; the category words such as 'princess dress', 'one-piece dress', 'skirt', 'chiffon dress' and the like form category word vectors corresponding to 'princess dress'; under the category of the Japanese-Korean suit, category words such as the Japanese-Korean suit, a skirt, a one-piece dress, a chiffon dress and the like form category word vectors corresponding to the Japanese-Korean suit, and the category word vectors corresponding to the various industry categories are established and generated by a big data statistical method. From the above, the industry category distribution vector of the leaf node corresponding to the korean chiffon skirt may be { one-piece dress, stock dress, princess dress, japanese and korean dress }.

Preferably, each category word in the category word vector may also be weighted, such as { one-piece dress, 0.5; skirt, 0.3; chiffon skirt, 0.6; korean, 0.2} and the like. By setting the weight, the industry categories can be more targeted and oriented, and the common habits of users can be better met, so that the industry categories corresponding to the query words can be determined more quickly.

In addition, the query term may be a term indicating its own intention filled in a search box when the user uses a search engine. However, it should be noted that the input method is not limited to the input method of the search box, and other input methods, such as clicking corresponding items, links, etc., are included in the scope of "input" of the present invention. Step S206: and determining the industry categories of the leaf nodes to which the query words belong according to the industry category word vectors of the leaf nodes corresponding to the query words.

One way is to preprocess the query word into at least one query keyword, then match the query keyword with category words in category word vectors of industry categories, and determine the industry categories of leaf nodes to which the query word belongs according to the matching degree. For example, the Korean chiffon skirt is preprocessed into two query keywords, namely a Korean chiffon skirt and a chiffon skirt, and after matching, the category word matching degree in the category word vector corresponding to the Korean chiffon skirt and the one-piece dress is the highest, namely the one-piece dress is determined as the industry category of the leaf node to which the Korean chiffon skirt belongs.

Preferably, when the category words have weights, one possible way is: preprocessing the query word to generate at least one query keyword; matching category word vectors of industry categories for each leaf node using the at least one query keyword; determining the weight of the industry category of each leaf node corresponding to the query word according to the weight of the category word vector of the industry category of each leaf node successfully matched; and determining the industry category of the leaf node with the highest weight as the industry category of the leaf node to which the query word belongs. Still take "korean chiffon skirt" as an example, assume: the weight of each category word of the category word vector corresponding to the 'one-piece dress' is { one-piece dress, 0.8; skirt, 0.3; chiffon skirt, 0.6; korean version, 0.2 }; the category word weight of the category word vector of the 'stock women' dress is { stock women dress, 0.9; women's dress, 0.8; skirt, 0.6; one-piece dress, 0.6; chiffon skirt, 0.3; frock, 0.4; panty, 0.2; women's suit, 0.3 }; the category word weight of the category word vector of the princess skirt is { princess skirt, 0.9; one-piece dress, 0.7; skirt, 0.5; chiffon skirt, 0.4 }; the category word weight of the category word vector of the 'japanese-korean suit for women' is { japanese-korean suit for women, 0.9; skirt, 0.8; one-piece dress, 0.8; chiffon skirt, 0.6 }. Taking the simplest weight addition as an example, after preprocessing the Korean chiffon skirt into the Korean and the chiffon skirt, matching the Korean chiffon skirt with the category word vector of the industry category of each leaf node, and the result is as follows: "one-piece dress" { chiffon dress, 0.6; korean edition, 0.2}, "stock dress" { chiffon skirt, 0.3}, "princess skirt" { chiffon skirt, 0.4}, "japanese dress" { chiffon skirt, 0.6}, and then, after weighting, the weight of "one-piece skirt" corresponding to "korean edition chiffon skirt" is 0.8, the weight of "stock dress" corresponding to "korean edition chiffon skirt" is 0.3, the weight of "princess skirt" corresponding to "korean edition chiffon skirt" is 0.4, and the weight of "japanese dress" corresponding to "korean edition chiffon skirt" is 0.6. Through weighted comparison, the industry category with the highest weight, namely the one-piece dress, can be determined as the industry category of the leaf node to which the Korean chiffon dress belongs. It should be noted that the above processing on the weight is only a simple exemplary illustration, and in practical applications, a person skilled in the art may perform various forms of weight processing on the obtained weights of various categories of words according to actual needs to meet the requirements, and the present invention is not limited thereto.

Step S208: and determining the industry category where the query word is located according to the determined industry category of the leaf node and the preset mapping relation of various levels of industry categories.

One of the simplest ways is to step up the industry categories along the leaf node until the industry categories that meet the set conditions.

A preferable mode is that the upper-level industry category corresponding to the industry category of the leaf node is determined according to the preset mapping relation of the industry categories of all levels; acquiring the weight of the query word corresponding to the upper-level industry category corresponding to the industry category of the leaf node; judging whether the weight of the upper-level industry category corresponding to the query word exceeds the weight of a second set threshold value; if yes, determining the upper-level industry category with the weight exceeding a second set threshold as the industry category where the query word is located; if not, continuing to search upwards until the industry category with the weight exceeding the second set threshold value is searched, or until the industry category of the root node is searched. The method for obtaining the weight of the upper industry category corresponding to the query term may adopt the same method as the method for determining the weight of the industry category of the leaf node corresponding to the query term in step S206; the second set threshold is set by those skilled in the art according to actual needs, and the present invention is not limited to this.

In another mode, absolute value weighting is carried out according to the matching degree of the query words and the category names of various industry categories; and determining the industry category where the query word is located according to the absolute value weighting result and the preset mapping relation of all levels of industry categories.

For example, if the query word input by the user is "industrial alcohol", the relevant categories "industrial alcohol", "medical alcohol", and "edible alcohol" may be returned. The original classification is mainly characterized by the statistical mapping relation between query words and categories and the text relevance between the query words and the categories (the text relevance between the query words and the categories refers to the fact that Chinese word segmentation and part of speech tagging are firstly carried out on titles/attributes filled in by all commodity users under the categories, and the mapping from the categories to Chinese word segmentation Term vectors is obtained through statistics. Such as: the query word of the industrial alcohol is completely matched with the category of the industrial alcohol, and the medical alcohol and the edible alcohol are only partially matched, so that more weights can be added to the category of the industrial alcohol, and a proper smaller weight is added to the partially matched medical alcohol and the edible alcohol.

One formula for increasing the weight is as follows:

ContextWeight=FactorContext*2*/SameTW（SumQW+SumCW）

wherein SameTW represents the TermWeight weight sum of the same part of the query term and the category name participle; SumQW represents the sum of the total weights of the query word participles TermWeight; SumCW represents the sum of the total weights of the category name participles TermWeight; FactorContext represents the weight of the absolute value of the matching between the query word and the category name, and the parameter can be modified by the optimization algorithm effect process of the parameter adjustment system according to the matching degree between the query word and the category.

The other mode is that Chinese word segmentation labeling information of the query word is obtained; and determining the industry category of the query word according to a preset mapping relation of all levels of industry categories and Chinese word segmentation marking information of the query word and a set rule.

For example, the system in this embodiment uses the word segmentation tool AliWS based on the e-commerce Chinese thesaurus, and the part-of-speech tagging tool TermWeight. The AliWS Chinese word segmentation tool can segment the query word and the title of the commodity, for example, the query word 'chiffon one-piece dress' can be divided into 'chiffon/one-piece dress'. TermWeight can tokenize words, according to the above example, "chiffon (modifier)/one-piece dress (product word)". Through Chinese word segmentation and part-of-speech tagging, the main intentions of the user can be analyzed, for example, product words, main modifiers, regional division words, product models and other information carried by the query words can be identified. The method has the advantages that the method can optimize an algorithm model by segmenting commodity information such as commodity titles, attributes and the like, and improve matching and accuracy of the commodity which is intended by a user; meanwhile, TermWeight can weight the importance degree of each participle, for example, in the 'chiffon one-piece dress', which is a core word of a product and indicates the main purpose of a user for searching commodities, if the 'one-piece dress' is more important, the weight of the 'one-piece dress' is set to be 150 (the assumption is that the actual weight changes according to the requirement of the effect of a project), and the weight of the 'chiffon' is set to be 60.

In another mode, an entropy value of an industry category is calculated; and determining the industry category where the query word is located according to the entropy and a preset mapping relation of various levels of industry categories.

The entropy value is the information entropy of all levels of industry categories, and different query words have different category coverage ranges. For example: "clothing" corresponds to entropy of industry category > "men's jeans" corresponds to entropy of industry category. Therefore, the characteristics of the level I can be identified as the query words according to the entropy values of the industry categories of each layer.

In addition, it should be noted that the mapping relationship of each level of industry category may be implemented by the implementer of the present invention, or may be provided by a third party, such as by a third party generation tool, or by accessing a third party stored data.

When the industry category where the determined query word is located meets the requirement through the above steps S202-208, the query word may be directly classified into the determined industry category. If the requirement cannot be met, for example, the weight of the industry class is smaller than the second set threshold, the following supplement scheme can be adopted to continue to determine the industry class in which the query word is located.

Step S210: judging whether the industry category where the determined query word is located meets a set standard, if so, classifying the query word into the determined industry category, and executing the step S214; if not, go to step S212.

The setting criteria are set by those skilled in the art according to actual conditions, and the present invention is not limited thereto.

Preferably, whether the determined industry category of the query word meets the set standard or not can be judged by judging whether the recall number of the industry category of the query word meets a third set threshold or not and/or judging whether the click number of the industry category of the query word meets a fourth set threshold or not. The third set threshold and the fourth set threshold may be set by those skilled in the art according to practical situations, and the present invention is not limited thereto.

Step S212: and determining the industry category where the query word is located by using the first identification rule and/or the second identification rule, and executing step S214.

The first identification rule is used for determining the industry category where the query word is located according to the industry categories of other leaf nodes except the industry category of the leaf node in the leaf industry category distribution vector; the second identification rule is used for determining the industry category where the query word is located according to the weight of the query word in each industry category.

Preferably, when the first identification rule is used to determine the industry category where the query word is located, the industry category where the query word is located may be determined according to the query word and the category word vectors of the industry categories of other leaf nodes except the industry category of the leaf node in the leaf industry category distribution vector.

Still taking "korean chiffon skirt" as an example, assume that besides { one-piece dress, stock dress, princess dress, japanese dress } the terms "korean version" and "chiffon skirt" in the category "foreign trade skirt" are weighted more heavily, but for some reasons are not included in the system priority and do not appear in the preceding leaf industry category vector. When the final industry category determined cannot meet the requirement, a first recognition rule may be used, such as setting a recognition threshold, to help the "foreign trade skirt" category recall in the leaf industry category vector of the query word "korean chiffon skirt". And further, with the foreign trade skirt as the determined industry category of the leaf node, finally determining the industry category where the query word is located according to the preset mapping relation of all levels of industry categories.

Preferably, when the second identification rule is used to determine the industry category where the query word is located, whether the weight of the query word in each industry category is greater than or equal to a first set threshold value or not may be determined step by step from the industry category of the root node to the industry category of the leaf node according to the category word vector corresponding to each industry category, and the industry category corresponding to the weight greater than or equal to the set threshold value is determined as the industry category where the query word is located.

Still taking the korean chiffon skirt as an example, when the second recognition rule is used, the relationship between the terms under the industry categories is established, for example, the weights of the korean chiffon skirt and the chiffon skirt under the second-level industry categories female garment are heavier than those of the second-level industry garments male garment and children garment. And determining the probabilities of respectively identifying the female garment, the male garment and the children garment when the query word is the Korean chiffon skirt by using a Bayes classifier. And setting an identification threshold value, constructing a bridge when the probability of the query word and the industry category is greater than the threshold value, and considering that the query word identifies the industry category, so that the problem of identification error caused by click bias in the process of moving from bottom to top is avoided. For example, the industry category of the root node is "clothing", the industry categories of the secondary nodes below the root node are "women's clothing", "men's clothing" and "children's clothing", and when the query word is determined to be "korean chiffon skirt" by using the bayesian classifier, the probabilities of identifying "clothing", "women's clothing", "men's clothing" and "children's clothing" are respectively 0.6, 0.9, 0.2 and 0.3. Setting the identification difference value to be 0.8, when the user inputs the Korean chiffon skirt, searching downwards from the root node, and when the female garment with the probability of 0.9 is found, considering that the female garment satisfies the condition, wherein the industry category where the Korean chiffon skirt is located is the female garment. For another example, because there are many women's dresses of websites, the transaction and click data of websites are mainly concentrated on women's dresses. For example, the process of using the short-sleeve T-shirt in the steps S202-208 needs to be firstly identified into the leaf node and the industry category of the female short-sleeve T-shirt, and then the female short-sleeve T-shirt can be finally identified into the garment through the mapping relation of the industry categories at all levels. The short-sleeve T-shirt can directly skip leaf nodes and industry categories through the built query words to the industry category bridge and can identify the first-level industry category garment, so that the problem of category bias is avoided.

In addition, when the category words are provided with the weights, the weights of the category words in each category word vector can be adjusted according to the matching results of the query words and the category words in each category word vector and by referring to artificially labeled training data, so that the weights of the category words can reflect the actual classification results, and further the classification of the follow-up query words is more accurate.

Step S214: and displaying the query result under the industry category where the query word is located by using the set industrialized search heading or the set industrialized search result display template.

This step is an optional step and in some applications it is not necessary to display the query results.

But the query result is displayed through the set industrialized search heading or the set industrialized search result display template, so that the development of an industrial market page is promoted while the accurate in-industry query of the user is provided.

According to the embodiment, the industry which needs to be queried by the user is determined according to the query words, so that the industry query is effectively realized, and the query efficiency and the search experience of the user are improved; meanwhile, the industry categories of the query words are accurately positioned from the leaf nodes, so that the industry query result is more accurate, and the query efficiency is higher

EXAMPLE III

Referring to fig. 3, a flowchart illustrating steps of a method for classifying query terms according to a third embodiment of the present invention is shown.

step S302: and generating and storing the corresponding relation between the query words and the industry categories of the leaf nodes.

One way is to generate and store the corresponding relation between the query word and the industry category of the leaf node through the search click information in the user search click log. In this way, the front end of the search page needs to be set with a click record click behavior, for example, a parameter can be directly added to the URL for clicking commodity information and category navigation to indicate that the URL is a search click for the user. The front-end search page can obtain the user information, the query word information, the commodity information and the category information of the commodity of the page. And finally, combining the information into data with a standard format, storing the data in a data warehouse, and using the data as a user search click log. The format specification can be set adaptively by those skilled in the art according to the storage setting of the used database, etc., so that the information can be normally stored in the data warehouse. For the user search click logs with mass data, a distributed computing system can be used for storing large-scale user search click log data so as to ensure that computing tasks can be completed on time.

Then, using the stored user search click log to connect all Offer Id (commodity identification (one kind of information in the commodity information)) clicked after user query with Offer (commodity supply) information table of the website (the Offer information table contains Offer display leaf categories and release leaf categories), and obtaining the category to which the commodity indicated by the commodity identification belongs through the Offer information table, and further counting the mapping relation from the Offer table to the industry category of the leaf node. The mapping relationship generated in this manner is also referred to as a mapping relationship generated from click characteristics.

The other mode is that the information and the commodity information title are clicked through searching of the user, and the corresponding relation between the query word and the industry class of the leaf node is generated and stored. For example, taking an arioba website as an example, important terms such as products and attributes in Offer titles and corresponding weights can be mined through an arioba Chinese word segmentation standard system, a mapping relation from an industry category to Term (products, attributes and the like) is established, and a mapping relation from a query Term to the industry category is further established according to search click information. The mapping relationship generated in this manner is also referred to as a mapping relationship generated from the text relevance feature.

A process of generating an industry category correspondence between a query term and a leaf node in the above two ways is shown in fig. 4, and includes: after receiving search or click input of a plurality of users, a front page of the website generates search click data; then, storing the search click data to a data warehouse; after acquiring search click data of a user from a data warehouse, the distributed computing system uses a plurality of computing nodes (3 in the example in the figure) for processing to generate query term classification data, namely data of the corresponding relation between the query term and the industry category of the leaf node; further, a dictionary index online service is established for subsequent use and lookup.

Step S304: and generating and storing the mapping relation of the industry categories at all levels.

For the industry category construction of electronic commerce, a tree structure can be used for generating and storing mapping relations of various levels of industry categories, and the tree structure can also be called an industry category tree. The bottommost category is called an industry category of the leaf node, the first level is several levels from the top, and the first level is the top level, namely the industry category of the root node. An industry category tree for e-commerce is shown in fig. 5. as can be seen from fig. 5, the industry category tree has three layers, including: top layer categories: "clothing"; class II: "women's wear", "men's wear" and "children's wear"; the underlying industry category corresponds to "women's dress" is: "one-piece dress, sweater, chiffon skirt", corresponding to "men's clothing" are "men's jeans, men's shirt", corresponding to "children's clothes" are "children's suspender trousers and children's skirt". That is, "one-piece dress, sweater, chiffon skirt" maps up to "women's dress", "men's jeans, men's shirt" maps up to "men's dress", "children's suspender trousers and children's dress" maps up to "children's dress"; while "women's clothing", "men's clothing" and "children's clothing" all map up to "clothing".

The industry category to which the commodity belongs is multi-layered, and the data can be maintained by professional website operators. The industry category tree is stable, and adding, deleting and modifying category information requires unified operation management.

Through the industry category tree, the corresponding relation of the industry category attribution industry of the specific commodity can be determined. For some fuzzy and disputed industry category attribution problems, the industry to which the category belongs can be determined through statistics by using data based on commodity information clicked by a user when the user browses commodities in the category.

In addition, each category in the industry category tree has a category word vector, and the category words in each category word vector can be set with weights.

Step S306: the query terms are classified.

In this embodiment, the query term classification is implemented as a process of obtaining a category distribution with a weight through a plurality of features and a classification algorithm.

An implementation flow of classifying the query term is shown in fig. 6, and includes:

step S3062: a query term is input.

In this embodiment, query terms may be input through a WEB page at a front end of the website, and a corresponding function module at a background server end is called in an HTTP manner to perform subsequent query term classification operations. For example, calling an industrial recognition algorithm module deployed on a Query Process platform of Ali Bara in an HTTP manner realizes subsequent Query word classification operation.

Step S3064: determining the category words in the category word vector matched with the query words.

Step S3066: and generating a leaf industry category distribution vector corresponding to the query term.

In this embodiment, the click statistics result adopts a mode of "query word- > leaf industry category distribution", and the weight of each leaf node industry category in each leaf industry category distribution vector is the sum of the category word weights.

Step S3068: and determining the industry category of the leaf node to which the query word belongs, and mapping upwards according to the industry category tree.

Step S30610: and determining the industry category of the query word.

For the final determination of the query term to that level of industry category, two basic approaches can be used: the method comprises the steps of firstly, determining by setting rules on the basis of results of Chinese word segmentation and labeling of query words; and in the second mode, the industry category weight vectors of all layers are input into the parameter adjusting module, and the confidence that the query words can be in the current layer is determined.

For example, the weights of categories and industries are adjusted according to the result of Chinese part-of-speech tagging. For example, a "female T-shirt" includes the gender word "female," and thus the last category, industry weight, may be weighted down for the category associated with "men's clothing" that is incompatible with the query word.

Furthermore, the following may be used, for example:

and determining according to the click entropy value from the query word to the industry category, namely: calculating the entropy value of a leaf industry category distribution vector corresponding to the query word as a characteristic for identifying an industry category level; and determining the industry category of the query word according to the entropy value.

And determining according to the related weight of the query word and the industry category name text, namely: absolute value weighting is carried out on the leaf industry category classification vector corresponding to the query word, and the matching degree of the query word and the industry category on the name surface is increased; and determining the industry category of the query word according to the weighting result.

And a fifth mode, namely determining the mode of the industry class of the low-frequency query word as follows:

the low frequency query term refers to a query term that is clicked less than a set number of times, for example, 20 times, in a past period of time (for example, 180 days).

When a user uses a search engine, some query words which do not appear in a user search click log are often used, so that the classification basic characteristics of the user click are reduced. Moreover, searching for a query term with a click number less than a set number (e.g., 20 times) does not provide complete confidence in a simple click classification. For example, if a user searches for "korean eastern gate long form dress," and records in the search click log that the user clicked the Offer of "boshemia long dress" 3 times, if only this click data is used as a classification feature without expanding the query word, many correct category recalls are reduced. Therefore, the original query term needs to be transformed.

The following are two example algorithms for transforming query terms:

the first algorithm is as follows: the KNN Query, namely a KNN (K-Nearest Neighbor) Query word algorithm, is a K Nearest Neighbor classification, and K Nearest classified neighbors of a classified object in a training data set are used for guiding the classification of the classified object. The inverted arrangement from the query word to Term can be established through the search click log of Offer by the user, and the distance between the participle vector of the original query word and the cosine of other query word vectors is used as a proximity classification data set. Then, the set number of the adjacent original query terms is integrated, such as the classification results of the first ten query terms, and the classification results are used as the classification results of the original query terms. And then, mapping in the industry category tree according to the classification result to determine the industry category where the query word is located.

The Term represents Chinese word segmentation, and after the query word is segmented, vectors with weights and parts of speech of a plurality of terms are obtained. The adjacent classification data set is a corresponding relation between Chinese participles Term and related query word vectors which show user search click behaviors. The vectors of this neighboring classification dataset are also weighted.

And (3) algorithm II: and (4) word reduction transformation, for example, processing a Chinese word segmentation two-square base TermWeight by using a natural language, giving each word segmentation weight, removing the word segmentation with the minimum weight, and judging whether the transformed query word is a high-frequency word. If yes, taking the converted query term classification result as the classification result of the original query term; otherwise, continuously removing the next participle with the minimum weight, and repeating the step until the number of the participles is 1. And then, mapping in the industry category tree according to the classification result to determine the industry category where the query word is located.

The Chinese word segmentation vectors of the query words are weighted according to the importance degree of the word segmentation, and when one query word input by a user does not have a search click behavior in the past, the word segmentation with the minimum weight is removed one by one to be combined into a new query word. This de-wording process continues until the new query term that is combined appears in the user search click (i.e., there is a corresponding category vector). And then using the category vector corresponding to the new query word as the category vector of the user input query word.

It should be noted that the above two transformation algorithms are only exemplary, and in practical applications, those skilled in the art may also adopt other suitable transformation algorithms according to practical situations.

And sixthly, adopting natural language processing, such as AliWS word segmentation and TermWeight attribute labeling and empowerment by using an Alriva Chinese word segmentation tool. For example, after the query terms are segmented, the query terms can be converted by methods such as KNN and term reduction conversion, and the like, on the premise of ensuring the algorithm effect, the query terms with dense data are converted, and the accuracy of query term classification is improved. And further determining the industry category where the data-intensive query word is located.

In this embodiment, the industry category where the query word is located is determined in a first materialized manner. That is, the result of the word segmentation and labeling in the query word is embodied as one or more levels of industry categories mapped by the industry categories of the leaf nodes corresponding to the query word; the set rule is embodied as an industry category which judges whether the weight of each level of the mapped industry categories exceeds the weight of a set threshold (also called a peak value) or not, and if yes, the industry category is determined as the industry category where the query word is located; otherwise, determining the industry category where the query word is located by using other appropriate rules, such as determining whether an industry category with an industry category entropy smaller than a set entropy threshold exists, or determining whether an industry category with a maximum weight difference from the second category larger than a set weight difference threshold exists, or mapping to the root industry category all the time. The above thresholds can be set by those skilled in the art as appropriate according to actual circumstances.

The category vectors corresponding to the query terms are sorted according to the weight. The second category is a category having the second largest weight, and whether the first category is the peak category (the difference is large enough) can be determined by using the difference between the weight of the first category and the weight of the second category.

After the materialized method one is adopted, the step in this embodiment is to determine whether the query word has a category whose weight exceeds the peak weight in the mapped industry categories of each level, and if so, determine the category as the industry category where the query word is located; if not, the set other rules are adopted, such as the rule for determining the industry categories according to the entropy threshold or the weight difference threshold, or the rule always mapped to the root industry categories, so as to determine the industry categories where the query words are located.

Step S30612: and determining a multi-layer category structure according to the industry category where the query word is located, and outputting a query result.

The output of the query result can be recognized by the query words to the distribution of the multi-layer industry categories and represent integers which are trusted by the industry categories of the several layers, for example, four-layer categories are returned: the industry category of the leaf node, the industry category of the secondary node, the industry category of the primary node, bu (business unit).

On the basis of the query term classification, parameter adjustment can be used, for example, a parameter adjustment module is used for adjusting the influence of each characteristic of the query term classification on a classification result, and the effect of an algorithm is optimized.

The implementation of parameter adjustment is specifically described below.

Parameter adjustment requires 3 preconditions: (1) manually labeling query term classification data which is distributed similarly to a real search engine; (2) an automation program that can automatically modify the classification parameter configuration items; (3) a machine learning algorithm module that can handle linear regression problems.

Wherein,

with respect to manually labeling query term classification data, operators may provide professional classification data to make the sampling of query terms conform to the distribution of the prediction target, i.e., the true space. For example, 1000 pieces of sample data can be randomly selected by searching the historical traffic log by taking the number of searches as a weight, so as to meet the premise that the sampling of machine learning is required to accord with the real space. The category of each level to which each query term belongs is labeled and classified manually, for example, the query term "princess dress" is classified as "consumer product-clothing-dress-one-piece dress" from top to bottom.

According to the service requirements of the query term classification, the problem of parameter adjustment can be divided into two types: the first parameter adjustment problem is that for the category distribution of each layer (for example, "princess dress" statistically relates to a weighted category distribution vector- "one-piece dress", "chiffon dress", "lace dress", etc. in the click log), proportional weights of a plurality of linear features (linear features refer to two-two irrelevance between features) need to be adjusted to achieve the optimal algorithm effect. Such mathematical problems can be solved using linear regression. The second parameter adjustment problem is to adjust the recognition threshold CateThre of each hierarchy category to determine that the query word specifically recognizes the peak category of the hierarchy.

With respect to the first kind of parameter adjustment problem,

the automatic program can calculate the grade of the current parameter adjustment according to the classification result of the query word and by referring to the manually labeled training data, the grade is an F mean value with a heavy accuracy (namely, the gamma value in the following formula is properly larger than 1), and corresponding adjustment can be performed according to the business requirement gamma (artificial experience judgment, the business requirement is biased to recall rate or biased accuracy rate, and gamma does not change, so that the method does not belong to the process of machine learning in parameter learning).

Wherein, the formula of the F mean value is as follows:

<math> <mrow> <msub> <mi>F</mi> <mi>γ</mi> </msub> <mo>=</mo> <mrow> <mo>(</mo> <mn>1</mn> <mo>+</mo> <mi>γ</mi> <mo>)</mo> </mrow> <mo>.</mo> <mfrac> <mi>precision</mi> <mrow> <mrow> <mo>(</mo> <msup> <mi>γ</mi> <mn>2</mn> </msup> <mo>.</mo> <mi>precision</mi> <mo>)</mo> </mrow> <mo>+</mo> <mi>recall</mi> </mrow> </mfrac> </mrow> </math>

the meaning of each parameter in the above F-means formula is as follows:

precision: the accuracy rate of the system for distributing the query words to correct categories and levels is represented, namely the number of classified correct samples/the total number of samples;

recall: recall rate representing the sample classification by the present System-number of samples with classification results (identified to category or industry)/Total number of samples

γ: when gamma is 1, the accuracy and recall have the same weight on the F-means. The greater gamma, the greater the weight that accuracy impacts F-means. The system focuses on the accuracy of the classification, so that γ is a value greater than 1.

In the calculation process, for a set of feature weights output by parameter adjustment, the accuracy and recall rate of the set of feature weights under a training sample can be obtained, and finally, the set of parameters with the maximum F-mean value is reserved.

The solution of the first parameter adjustment problem using linear regression is described below.

The linear regression is defined as:

Y_i=β₀+β₁X_i1+β₂X_i2+…+β_pX_ip，+_i，i＝1，...，n

giving a random sample (Y)_i,X_i1,...，X_ip) N, a linear regression model assuming the regressor Y_iAnd the regressive quantity X_i1,...，X_ipThe relationship between may be imperfect. Adding an error term_i(also a random variable) to capture in addition to X_i1,...，X_ipAny other pair Y_iThe influence of (c). A multivariate linear regression model is represented in the above form. It is important to distinguish between random variables and the observed values of these variables. Typically, an observed value or data (in lowercase notation) includes n values (y)_i，x_i1，...，x_ip)，i=1，...，n.

With p +1 parameters p₀,...，β_pIt is decided that for estimating these parameters it is useful to use the matrix Y = X β + notation.

Wherein Y is an observation value including Y₁，...，Y_nIncluding the unobserved random components₁,...,_nAnd an observed value matrix X of the regressions:

using the matrix representation X is:

x typically comprises a constant term.

If there is a linear correlation between the X rows, the parameter vector β cannot be estimated by the least squares method unless β is constrained, such as requiring its elements to sum to 0.

The initial goal of regression analysis is to estimate the parameters of the model in order to achieve a best fit to the data. Among the different criteria for determining a best fit, the least squares method is very advantageous. This estimate can be expressed as:

an automatic learning program is realized through code engineering. Xi is the ith feature, β i is the weight of the ith feature, and Y is the F-means vector of the evaluation result of the combined parameter (in the above formula)）。

For the second kind of parameter adjustment problem,

the second parameter adjustment problem is to adjust the recognition threshold CateThre of each hierarchy category to determine that the query word specifically recognizes the peak category of the hierarchy. The identification sequence is from bottom to top, the leaf category is judged firstly, then the second level, the first level and the BU are carried out finally. For example, the query word "princess dress" obtains a leaf industry category distribution vector and a weight of [ "one-piece dress": 0.5, "chiffon dress": 0.2, "lace dress": 0.1], and a given industry category threshold value leaftree =0.4 for identifying peak leaf nodes, so that the industry category "one-piece dress" is identified, the industry category tree is found in the previous stage, and a given peak secondary category secondcatetree =0.5 is identified for identifying women's dress.

Therefore, the peak category threshold CateThre of each layer has a great influence on the final algorithm effect. An enumeration approach may be used to approach the optimal algorithmic effect. Empirically, the threshold range for CateThre is in the range of [0.3, 0.7 ]. In this range, each enumeration step is 0.01, resulting in 40 different catethres. In the query term classification system of the present embodiment, there are 4 layers of categories: LeafCat, SecondCate, FirstCat, BUCate. The combination resulted in 40 × 40=2560000 peak category parameter configurations. And (3) quickly calculating all parameter configuration combinations by using a distributed computing platform, selecting the combination with the maximum evaluation F mean value, and reserving the combination for on-line use.

According to the parameter adjustment described above, a set of algorithm parameter configurations optimized in the sample space can be obtained through multiple iterations.

Therefore, through the parameter adjustment, when a plurality of parameters exist in the classification model, the model is closer to the actual distribution of data through adjusting the values of the parameters, and the optimal effect of the algorithm is achieved.

Step S308: and displaying the classification result of the query term by using a set industrialized search topping or a industrialized search result display template.

When the classification result of the query word is displayed by using the industrialized search pullover, after the search page obtains the industry category information of the query word, the search pullover templates of different industries are called, and the navigation category on the menu bar is selected in an accurate mode. For example, after a user searches for a dress, the search page may jump to a "dress" market search page. And the ' women's clothes ' are selected, the leaf category under the ' women's clothes ' is opened, and the search result only shows the commodity information under the ' clothes-women's clothes ' category.

Because the heading appears in the key position of the page and has the function of screening the search results, the industrialized identification requires high accuracy to avoid bad influence on the user experience.

When the search result classification results are displayed by using the industrialized search result display template, the display mode of the search result List can be controlled, and the attention points of users are generally different for the search results of different industries. For example, as a result of searching for clothing, the user is more inclined to browse pictures. Therefore, the search result is displayed in a large graph mode, and better search experience can be achieved. And the industrial product industry user focuses more on the attributes, the price and the origin, and displays the related attributes, the price and the origin of the product.

By the embodiment, search industrialization is realized by using query word classification and industry category tree and category industry mapping in electronic commerce; in the multilayer categories corresponding to the query words, a Chinese word segmentation labeling and parameter adjusting module is used for determining which layer of category the query words specifically belong to; the industrial information is applied to industrial search heading, industrial search result display modes and the like. The embodiment provides a standard for specifying the category tree and the industry mapping format, and can be conveniently expanded to other category trees. For different requirements of various category trees and different services on query word classification, the parameter adjusting module is provided for linearly allocating the weights of a plurality of competing characteristics, so that a machine automatically learns a set of parameters suitable for various requirements, and the high reusability of algorithm codes is realized.

Example four

Referring to fig. 7, a block diagram of a query term classification apparatus according to a fourth embodiment of the present invention is shown.

The query term classification apparatus of this embodiment includes: a first generating module 402, configured to obtain an input query term, and generate a leaf industry category distribution vector corresponding to the query term, where the leaf industry category distribution vector includes industry categories of at least one leaf node corresponding to the query term, each industry category of each level of nodes corresponds to a category word vector, and each category word vector includes multiple category words; a determining module 404, configured to determine, according to the category word vector of the industry category of each leaf node corresponding to the query word, an industry category of the leaf node to which the query word belongs; and the first classification module 406 is configured to determine, according to the industry category of the determined leaf node, the industry category where the query word is located according to a preset mapping relationship between industry categories at different levels.

Preferably, each category word included in each category word vector is provided with a weight; the query term classification apparatus of this embodiment further includes: the second classification module 408 is configured to determine, after the first classification module 406 determines the industry category where the query term is located according to the industry categories of the determined leaf nodes and according to a preset mapping relationship between the industry categories at different levels, whether the industry category where the query term is located meets a set standard; if yes, classifying the query words into the determined industry categories; if the leaf node type distribution vector does not meet the requirement, determining the industry category where the query word is located by using a first identification rule and/or a second identification rule, wherein the first identification rule is used for determining the industry category where the query word is located according to the industry categories of other leaf nodes except the industry categories of the leaf nodes in the leaf industry category distribution vector; the second identification rule is used for determining the industry category where the query word is located according to the weight of the query word in each industry category.

Preferably, the second classification module 408, when determining the industry category in which the query term is located using the first recognition rule and/or the second recognition rule: determining the industry category where the query word is located according to the query word and the category word vectors of the industry categories of other leaf nodes except the industry categories of the leaf nodes in the leaf industry category distribution vector; and/or determining whether the weight of the query word in each industry category is greater than or equal to a first set threshold value step by step from the industry category of the root node to the industry category of the leaf node according to the category word vector corresponding to each industry category, and determining the industry category corresponding to the weight greater than or equal to the set threshold value as the industry category where the query word is located.

Preferably, the determining module 404 includes: a generating module 4042, configured to pre-process the query term to generate at least one query keyword; a matching module 4044, configured to match category word vectors of industry categories for each leaf node using at least one query keyword; the weight module 4046 is configured to determine, according to the weight of the category word vector of the industry category of each leaf node successfully matched, a weight of the query word corresponding to the industry category of each leaf node; the leaf node determining module 4048 is configured to determine the industry category of the leaf node with the highest weight as the industry category of the leaf node to which the query term belongs.

Preferably, the first classification module 406 includes: the upper-level category module 4062 is configured to determine, according to a preset mapping relationship between industry categories at different levels, an upper-level industry category corresponding to the industry category of the leaf node; the upper-level weight module 4064 is used for acquiring the weight of the query word corresponding to the upper-level industry category corresponding to the industry category of the leaf node; the judging module 4066 is configured to judge whether the weight of the upper-level industry category corresponding to the query term exceeds the weight of the second set threshold; an executing module 4068, configured to, if the determination result of the determining module 4066 is yes, determine the upper-level industry category whose weight exceeds the second set threshold as the industry category where the query word is located; if the judgment result of the judgment module 4066 is negative, the upward search is continued until the industry category with the weight exceeding the second set threshold is found, or until the industry category of the root node is found.

Preferably, the query term classification device of this embodiment further includes: the correspondence generating module 410 is configured to generate a correspondence between the query term and the industry category of the leaf node, and includes: acquiring search click information of a user, wherein the search click information comprises: query word information input by a user and commodity information clicked by the user according to a query result; and acquiring the corresponding relation between the query word and the industry class of the leaf node according to the search click information and the corresponding relation between the commodity and the class of the commodity.

Preferably, the correspondence generating module 410, when obtaining the search click information of the user: obtaining a user search click log, wherein the user search click log comprises search click information of a plurality of users, and the search click information comprises: query word information input by a user, commodity information clicked by the user according to a query result, and category information to which commodities belong; the corresponding relationship generating module 410, when acquiring the corresponding relationship between the query word and the industry category of the leaf node according to the search click information and the corresponding relationship between the commodity and the category to which the commodity belongs: carrying out statistical analysis on search click information in a user search click log to obtain a corresponding relation between query word information and commodity information and a corresponding relation between the commodity information and the category information to which the commodity belongs; and generating the corresponding relation between the query word and the industry category of the leaf node according to the corresponding relation between the query word information and the commodity information and the corresponding relation between the commodity information and the category information to which the commodity belongs.

Preferably, when the corresponding relationship generating module 410 acquires the corresponding relationship between the query word and the industry category of the leaf node according to the search click information and the corresponding relationship between the commodity and the category to which the commodity belongs: acquiring commodity information and the category information to which the commodity belongs from the commodity information title, and determining the corresponding relation between the commodity information and the category information to which the commodity belongs; and acquiring the corresponding relation between the query word and the industry category of the leaf node according to the search click information and the corresponding relation between the commodity information and the category information to which the commodity belongs.

Preferably, the first generating module 402 is configured to obtain an input query term, and determine, according to a corresponding relationship between the query term and an industry category of a leaf node, an industry category of at least one leaf node that corresponds to the query term and meets a set rule; and generating a leaf industry category distribution vector corresponding to the query word by using the determined industry categories of the leaf nodes.

Preferably, the query term classification device of this embodiment further includes: and an adjusting module 412, configured to adjust the weight of the category word in each category word vector according to the matching result between the query word and the category word in each category word vector, in comparison with the artificially labeled training data.

Preferably, the second classification module 408, when determining whether the industry category where the query word is located meets the set standard: and judging whether the recall number of the industry category where the query word is located meets a third set threshold value and/or judging whether the click number of the industry category where the query word is located meets a fourth set threshold value.

The query term classification device of this embodiment is used to implement the corresponding query term classification method in the foregoing multiple method embodiments, and has the beneficial effects of the corresponding method embodiments, which are not described herein again.

EXAMPLE five

Referring to fig. 8, a schematic diagram of a system architecture for performing industrial recognition by using a query term classification device according to a fifth embodiment of the present invention is shown.

The system architecture for performing industrial recognition by using the query term classification device in the embodiment mainly includes: : the system comprises a webpage search front end SearchWeb, a Query Process of a Query word analysis platform and an industrialized webpage module.

The Query Process of the Query analysis platform is provided with an industrial identification module (having all functions of the Query classification device in the fourth embodiment), so that the industrial identification is used as a module for calling service on the platform. The industrialized web page (comprising industrialized search heading and/or set industrialized search result display template) provided by the industrialized web page module is different from the traditional electronic commerce web page, the search page display mode is more professional, and the industrial demand can be quickly responded.

For one-time user search, firstly, obtaining query word information by SearchWeb, and then calling an industrialized identification module of a QueryProcess platform. The industrialization identification module can calculate related industry categories and industries to which the query words belong according to the query word classification data, namely the industry categories to which the query words belong, and then returns the industry categories to SearchWeb. And then, the SearchWeb takes the industrial data of the query word as a search parameter and calls a search engine and an industrial webpage module. The search engine and the industrialized web page module can use industrialized parameters to complete a series of applications according to downstream requirements.

For example, an industrialized web page may lead to a industrialized search pullover page. As shown in fig. 9. At the moment, after the SearchWeb acquires the industry information, the search heading templates of different industries are called, and the navigation category on the menu bar is selected in an accurate mode. For example, in fig. 9, after the user searches for a dress, the SearchWeb will jump to the "dress" market search page. And moreover, the 'women's clothes 'are selected, the leaf category below the women's clothes is opened, and only the commodity information under the 'clothes-women' category appears in the search result. Because the heading appears in the key position of the page and has the function of screening the search results, the industrialized identification requires high accuracy to avoid bad influence on the user experience.

For another example, the industrialized web page may lead out the search result List to be displayed in a manner of displaying the query result. The user's focus is typically different for search results of different industries. For example, as for search results of clothing goods, a user is more inclined to browse pictures, so that the search results are displayed in a large-picture mode, and better search experience can be achieved. And the industrial product industry user pays more attention to the attribute, the price and the origin, and the related attribute, the price and the origin information of the commodity are displayed, so that better search experience can be realized.

By the embodiment, the industrial identification of the query word input by the user is realized, the industry which the user needs to query can be determined according to the query word, and the information corresponding to the query word in the industry is provided for the user to perform accurate in-industry query, so that the search experience of the user is improved; meanwhile, the method can also effectively promote the development of industry market pages.

Through the scheme of the invention, the identified industry information can be provided to the front end of the webpage and the search engine sequencing end when a user searches once, so as to determine the display mode of the commodity information list page and the top navigation of the search page and serve as a characteristic of the search engine sequencing. And guiding the user to the corresponding industry market by classifying the query words input by the user in the search box. According to the scheme provided by the embodiment of the invention, the structure of hierarchical classification of the query terms based on the click graph data of the search log is designed, an algorithm is designed, the corresponding leaf industry category distribution vector can be calculated for all the query terms according to the click data, and then the query terms are divided into correct industry markets by using the industry category structure data of the e-commerce website. In addition, the invention also provides a standard for specifying the category tree and the industry mapping format, and can be conveniently expanded to other category trees. For different requirements of various category trees and different services on query word classification, the parameter adjusting module is provided for linearly allocating the weights of a plurality of competing characteristics, so that a machine automatically learns a set of parameters suitable for various requirements, and high reusability of algorithm codes is realized.

By the method, search industrialization is realized by using query word classification and category tree and category industry mapping in electronic commerce; in the multilayer categories corresponding to the query words, a Chinese word segmentation labeling and parameter adjusting module is used for determining which layer of category the query words specifically belong to; the industrial information is applied to industrial search heading, industrial search result display modes and the like. But not limited to, those skilled in the art may also implement the solution of the present invention in a corresponding equivalent or alternative manner according to the description in the embodiment of the present invention, for example, other query term classifications with low accuracy may be used; the display mode and the like of each leaf industry category in the application of an industry page and the like are determined without using the industry category tree data and category industry mapping; return categories navigated using search categories, and the like.

The Query term Classification, i.e., Query Classification, i.e., the algorithm basis of the system, is a computer professional term and indicates the process of acquiring the Query term to the identified category (the system is the category of e-commerce). The simultaneous classification into each category is weighted (which can be understood as probability).

The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the apparatus and system embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference may be made to some descriptions of the method embodiments for relevant points.

Embodiments of the invention are operational with numerous general purpose or special purpose computing system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet-type devices, multiprocessor systems, microprocessor-based systems, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like.

Embodiments of the invention may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. Embodiments of the invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices. In a typical configuration, the computer device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory. The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium. Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. Computer readable media, as defined herein, does not include non-transitory computer readable media (transmyedia), such as modulated data signals and carrier waves.

The method and the device for classifying query terms provided by the invention are described in detail, a specific example is applied in the text to explain the principle and the implementation mode of the invention, and the description of the embodiment is only used for helping to understand the method and the core idea of the invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A query term classification method is characterized by comprising the following steps:

acquiring an input query word, and generating a leaf industry category distribution vector corresponding to the query word, wherein the leaf industry category distribution vector comprises industry categories of at least one leaf node corresponding to the query word, each industry category of each level of nodes corresponds to a category word vector, and each category word vector comprises a plurality of category words;

determining the industry category of the leaf node to which the query word belongs according to the industry category word vector of each leaf node corresponding to the query word;

and determining the industry category where the query word is located according to the determined industry category of the leaf node and the preset mapping relation of various levels of industry categories.

2. The method according to claim 1, wherein each category word included in each category word vector is provided with a weight;

after the step of determining the industry category where the query word is located according to the determined industry category of the leaf node and according to the preset mapping relationship of various levels of industry categories, the method further comprises the following steps:

judging whether the industry category where the query word is located meets a set standard or not;

if yes, classifying the query words into the determined industry categories; if the leaf industry category distribution vector does not meet the requirement, determining the industry category where the query word is located by using a first identification rule and/or a second identification rule, wherein the first identification rule is used for determining the industry category where the query word is located according to the industry categories of other leaf nodes except the industry category of the leaf node in the leaf industry category distribution vector; and the second identification rule is used for determining the industry category where the query word is located according to the weight of the query word in each industry category.

3. The method according to claim 2, wherein the step of determining the industry category in which the query word is located using the first recognition rule and/or the second recognition rule comprises:

determining the industry category where the query word is located according to the query word and the category word vectors of the industry categories of other leaf nodes except the industry category of the leaf node in the leaf industry category distribution vector;

and/or the presence of a gas in the gas,

and determining whether the weight of the query word in each industry category is greater than or equal to a first set threshold value step by step from the industry category of the root node to the industry category of the leaf node according to the category word vector corresponding to each industry category, and determining the industry category corresponding to the weight greater than or equal to the set threshold value as the industry category where the query word is located.

4. The method according to claim 2 or 3, wherein the step of determining the industry category of the leaf node to which the query term belongs according to the industry category term vector of each leaf node corresponding to the query term comprises:

preprocessing the query word to generate at least one query keyword;

matching category word vectors for industry categories for the respective leaf nodes using the at least one query keyword;

determining the weight of the industry category of each leaf node corresponding to the query word according to the successfully matched weight of the category word vector of the industry category of each leaf node;

and determining the industry category of the leaf node with the highest weight as the industry category of the leaf node to which the query word belongs.

5. The method according to claim 4, wherein the step of determining the industry category where the query term is located according to the determined industry category of the leaf node and according to a preset mapping relationship of various levels of industry categories comprises:

determining the upper-level industry category corresponding to the industry category of the leaf node according to the preset mapping relation of the industry categories of all levels;

acquiring the weight of the query word corresponding to the upper-level industry category corresponding to the industry category of the leaf node;

judging whether the weight of the upper-level industry category corresponding to the query word exceeds the weight of a second set threshold value;

if yes, determining the upper-level industry category with the weight exceeding the second set threshold as the industry category where the query word is located; if not, continuing to search upwards until the industry category with the weight exceeding the second set threshold value is searched, or until the industry category of the root node is searched.

6. The method of claim 1, wherein the correspondence between the query term and the industry class of the leaf node is generated by:

acquiring search click information of a user, wherein the search click information comprises: query word information input by a user and commodity information clicked by the user according to a query result;

and acquiring the corresponding relation between the query word and the industry category of the leaf node according to the search click information and the corresponding relation between the commodity and the category to which the commodity belongs.

7. The method of claim 6,

the step of acquiring the search click information of the user includes: obtaining a user search click log, wherein the user search click log comprises search click information of a plurality of users, and the search click information comprises: query word information input by a user, commodity information clicked by the user according to a query result, and category information to which commodities belong;

the step of obtaining the corresponding relation between the query word and the industry category of the leaf node according to the search click information and the corresponding relation between the commodity and the category to which the commodity belongs comprises the following steps: performing statistical analysis on search click information in the user search click log to obtain a corresponding relation between the query word information and the commodity information and a corresponding relation between the commodity information and the category information to which the commodity belongs; and generating the corresponding relation between the query word and the industry category of the leaf node according to the corresponding relation between the query word information and the commodity information and the corresponding relation between the commodity information and the category information to which the commodity belongs.

8. The method according to claim 6, wherein the step of obtaining the correspondence between the query term and the industry category of the leaf node according to the search click information and the correspondence between the commodity and the category to which the commodity belongs comprises:

acquiring commodity information and the category information to which the commodity belongs from a commodity information title, and determining the corresponding relation between the commodity information and the category information to which the commodity belongs; and acquiring the corresponding relation between the query word and the industry category of the leaf node according to the search click information and the corresponding relation between the commodity information and the category information to which the commodity belongs.

9. The method according to any one of claims 6 to 8, wherein the step of obtaining the input query term and generating the leaf industry category distribution vector corresponding to the query term comprises:

acquiring an input query word, and determining the industry category of at least one leaf node corresponding to the query word and meeting a set rule according to the corresponding relation between the query word and the industry category of the leaf node;

and generating a leaf industry category distribution vector corresponding to the query term by using the determined industry categories of the leaf nodes.

10. The method of claim 2, further comprising:

and according to the matching result of the query word and the category words in each category word vector, comparing with artificially labeled training data, and adjusting the weight of the category words in each category word vector.

11. The method according to claim 2, wherein the step of determining whether the industry category where the query word is located meets a set standard comprises:

and judging whether the recall number of the industry category where the query word is located meets a third set threshold value and/or judging whether the click number of the industry category where the query word is located meets a fourth set threshold value.

12. A query term classification device, comprising:

the system comprises a first generation module, a second generation module and a third generation module, wherein the first generation module is used for acquiring an input query word and generating a leaf industry category distribution vector corresponding to the query word, the leaf industry category distribution vector comprises industry categories of at least one leaf node corresponding to the query word, each industry category of each level of nodes corresponds to a category word vector, and each category word vector comprises a plurality of category words;

the determining module is used for determining the industry categories of the leaf nodes to which the query words belong according to the industry category word vectors of the leaf nodes corresponding to the query words;

and the first classification module is used for determining the industry category where the query word is located according to the determined industry category of the leaf node and the preset mapping relation of all levels of industry categories.

13. The apparatus according to claim 12, wherein each category word included in each category word vector is provided with a weight;

the device further comprises: the second classification module is used for judging whether the industry category where the query word is located meets a set standard or not after the first classification module determines the industry category where the query word is located according to the determined industry categories of the leaf nodes and the preset mapping relation of all levels of industry categories; if yes, classifying the query words into the determined industry categories; if the leaf industry category distribution vector does not meet the requirement, determining the industry category where the query word is located by using a first identification rule and/or a second identification rule, wherein the first identification rule is used for determining the industry category where the query word is located according to the industry categories of other leaf nodes except the industry category of the leaf node in the leaf industry category distribution vector; and the second identification rule is used for determining the industry category where the query word is located according to the weight of the query word in each industry category.

14. The apparatus of claim 13, wherein the second classification module, when determining the industry category in which the query term is located using the first recognition rule and/or the second recognition rule:

and/or the presence of a gas in the gas,

15. The apparatus of claim 13 or 14, wherein the determining module comprises:

the generating module is used for preprocessing the query word to generate at least one query keyword;

a matching module for matching category word vectors of the industry categories of each leaf node using the at least one query keyword;

the weight module is used for determining the weight of the industry category of each leaf node corresponding to the query word according to the successfully matched weight of the category word vector of the industry category of each leaf node;

and the leaf node determining module is used for determining the industry category of the leaf node with the highest weight as the industry category of the leaf node to which the query word belongs.

16. The apparatus of claim 15, wherein the first classification module comprises:

the upper-level category module is used for determining the upper-level industry categories corresponding to the industry categories of the leaf nodes according to the preset mapping relation of the industry categories of each level;

the upper-level weight module is used for acquiring the weight of the query word corresponding to the upper-level industry category corresponding to the industry category of the leaf node;

the judging module is used for judging whether the weight of the upper-level industry category corresponding to the query word exceeds the weight of a second set threshold value;

the execution module is used for determining the upper-level industry category with the weight exceeding the second set threshold as the industry category where the query word is located if the judgment result of the judgment module is yes; if the judgment result of the judgment module is negative, continuing to search upwards until the industry category with the weight exceeding the second set threshold value is searched, or until the industry category of the root node is searched.

17. The apparatus of claim 12, further comprising:

a correspondence generating module, configured to generate a correspondence between the query term and the industry category of the leaf node, where the correspondence generating module is configured to: acquiring search click information of a user, wherein the search click information comprises: query word information input by a user and commodity information clicked by the user according to a query result; and acquiring the corresponding relation between the query word and the industry category of the leaf node according to the search click information and the corresponding relation between the commodity and the category to which the commodity belongs.

18. The apparatus of claim 17,

the corresponding relation generating module is used for acquiring the search click information of the user: obtaining a user search click log, wherein the user search click log comprises search click information of a plurality of users, and the search click information comprises: query word information input by a user, commodity information clicked by the user according to a query result, and category information to which commodities belong; the corresponding relation generating module is used for acquiring the corresponding relation between the query word and the industry category of the leaf node when acquiring the corresponding relation between the query word and the industry category of the leaf node according to the search click information and the corresponding relation between the commodity and the category to which the commodity belongs: performing statistical analysis on search click information in the user search click log to obtain a corresponding relation between the query word information and the commodity information and a corresponding relation between the commodity information and the category information to which the commodity belongs; and generating the corresponding relation between the query word and the industry category of the leaf node according to the corresponding relation between the query word information and the commodity information and the corresponding relation between the commodity information and the category information to which the commodity belongs.

19. The apparatus according to claim 17, wherein the correspondence generation module, when acquiring the correspondence between the query term and the industry category of the leaf node according to the search click information and the correspondence between the commodity and the category to which the commodity belongs:

20. The apparatus of claim 13, wherein the second classification module, when determining whether the industry category where the query term is located satisfies a set criterion: and judging whether the recall number of the industry category where the query word is located meets a third set threshold value and/or judging whether the click number of the industry category where the query word is located meets a fourth set threshold value.