CN104050240A

CN104050240A - Method and device for determining categorical attribute of search query word

Info

Publication number: CN104050240A
Application number: CN201410225991.3A
Authority: CN
Inventors: 刘鎏; 苏晓东; 常富洋; 王安滨; 秦吉胜
Original assignee: Beijing Qihoo Technology Co Ltd; Qizhi Software Beijing Co Ltd
Current assignee: Beijing Qihoo Technology Co Ltd; Qizhi Software Beijing Co Ltd
Priority date: 2014-05-26
Filing date: 2014-05-26
Publication date: 2014-09-17
Also published as: WO2015180622A1

Abstract

The invention discloses a method and a device for determining the categorical attribute of a search query word. The method comprises the following steps: performing characteristic extraction on an input search query word to obtain a corresponding characteristic vector; acquiring the categorical preference probability of the search query word by using a query word classifier according to the characteristic vector; analyzing the categorical preference probability of the search query word; determining the categorical attribute of the search query word. According to the technical scheme of the invention, the categorical preference probability of the search query word is obtained by extracting the characteristic vector of the search query word and inputting the characteristic vector into the query word classifier, and the categorical attribute of the search query word is determined by analyzing the categorical preference probability, so that a basis is laid for search, and the search accuracy is ensured. Moreover, basic characteristics can be provided for subsequent events such as search sequencing.

Description

A kind of method and apparatus of definite search query word category attribute

Technical field

The present invention relates to technical field of the computer network, be specifically related to a kind of method and apparatus of definite search query word category attribute.

Background technology

In once complete search, search engine can pass through pre-service search query word after receiving the search query word of user input conventionally, understand search query word, search file, the process such as sort, represent, and whole process need completes within the Millisecond time.And the classification of search query word is very important for understanding this process of query word, it can not only reflect the interest intention that active user is current, for this retrieval provides foundation, can serve as again the foundation characteristic of subsequent searches engine results order models, advertisement CTR prediction model, natural language model.

Support vector machine (SVM), for the one of supervised learning model in machine learning field, is proposed in nineteen ninety-five by people such as Vapnik.The most basic SVM model is " binary classification " model, and its mode of learning is maximize margin strategy.For simple linear separability data, go out hard interval support vector machine by " hard margin maximization " function learning; The data that can divide for approximately linear, learn out soft margin support vector machine by " soft margin maximization "; For the data of complete linearly inseparable, by data-mapping is arrived to more higher dimensional space, learn out soft margin support vector machine at higher dimensional space, after adopting in this course " kernel method " implicitly the inner product of the input space to be mapped to higher dimensional space, do again inner product, be equivalent at higher dimensional space study soft margin support vector machine.

Liblinear is by the Linear SVM software package that the Lin Zhiren of Taiwan Univ. teaches and research team develops, and has mainly realized linear multivariate classification and linear regression.Liblinear considers large-scale machines study application, and it does not introduce " kernel method ", but tentation data linearity or approximately linear can divide, directly training linear classifier.Through years development, liblinear is widely used in the solution of extensive classification and regression problem in industry member, and it is not only far superior to SVM in the performance of training and prediction, and its accuracy rate also reaches gratifying effect.From the angle of probability, in the text-processing project of industry reality, conventionally adopt boolean vector model, often hundreds of thousands is more than one hundred million at most at least for feature quantity, and the only sub-fraction data in Cover Characteristics space of the training data getting, therefore its linearly inseparable probability is just less.

The accuracy rate of visible existing search query word sorter still has much room for improvement.

Summary of the invention

In view of the above problems, the present invention has been proposed to a kind of a kind of search query word sorting technique and device that overcomes the problems referred to above or address the above problem is at least in part provided.

According to one aspect of the present invention, a kind of method of definite search query word category attribute is provided, wherein, the method comprises:

The search query word of input is carried out to feature extraction and obtain characteristic of correspondence vector;

The classification that obtains described search query word via query word sorter according to described proper vector lays particular stress on probability;

The classification of analyzing described search query word lays particular stress on probability, determines the category attribute of described search query word.

Alternatively, the method also comprises the following steps of the disaggregated model that obtains described query word sorter:

Obtain the labeled data of mark classification;

From the labeled data of each classification, sampling, more than the first preset value, is less than the data of the second preset value quantity, obtains such other data from the sample survey;

Data from the sample survey of all categories is trained and obtained disaggregated model.

Alternatively, described data from the sample survey of all categories trained and obtained disaggregated model and comprise:

In the process of training, to different classes of different penalty factors is set; Wherein, the classification setting larger penalty factor less to data number after sampling.

Alternatively, the matrix that described disaggregated model is m*n, m is classification number, and n is Characteristic Number, and each element a (i, j) of matrix represents the classification weights of j feature i classification;

Described data from the sample survey of all categories trained and obtained disaggregated model and also comprise:

By reducing the precision of the classification weights in disaggregated model, reduce the shared storage space of disaggregated model; And/or, use l1 regularization training pattern, to reduce the shared storage space of disaggregated model.

Alternatively, the labeled data that obtains mark classification described in comprises:

Obtain the labeled data of artificial mark;

And/or,

In advance website links is carried out to classification annotation; Specify the website links of clicking after search query word according to user search, set up the corresponding relation between described appointment search query word and the classification of the website links of clicking, obtain labeled data.

Alternatively, described data from the sample survey of all categories is trained and obtained disaggregated model and comprise: adopt Liblinear to train and obtain disaggregated model data from the sample survey of all categories;

The described search query word to input is carried out feature extraction and is obtained characteristic of correspondence vector and comprise: the search query word to input is carried out participle, utilizes the proper vector of the result structure libsvm form after participle.

Alternatively, before the described search query word to input is carried out feature extraction, the method also comprises: the search query word to input is carried out pre-service;

Described pre-service comprises one or more in following processing: long word filters, deletes special character and deletes stop words.

Alternatively, the method also comprises: preset buffer memory, and in described buffer memory, preserve the search query word of some and corresponding classification and lay particular stress on probability;

Before the described search query word to input is carried out feature extraction, the method also comprises: according to the search query word query caching of input; If hit buffer memory, directly the classification of the search query word of the described input of output lays particular stress on probability; If do not hit buffer memory, carry out described step and the subsequent step that the search query word of input is carried out to feature extraction.

Alternatively, described preset buffer memory, in described buffer memory, preserve the search query word of some and corresponding classification and lay particular stress on probability and comprise:

Find out from determining that classification lays particular stress on the search query word of probability the search query word that is queried the front predetermined number that number of times is maximum, the search query word of this predetermined number and corresponding classification are laid particular stress on to probability correspondence and be saved in described buffer memory;

Or,

For different content distributing network CDN preset buffer memory respectively; For the buffer memory of each CDN, lay particular stress on the search query word of probability from definite classification of accessing this CDN, find out the search query word that is queried the front predetermined number that number of times is maximum, the search query word of this predetermined number is laid particular stress on to probability correspondence with corresponding classification and be saved in the buffer memory of this CDN.

Alternatively, the method further comprises:

According to the category attribute of the described search query word of determining, Search Results is sorted.

According to another aspect of the present invention, a kind of device of definite search query word category attribute is provided, this device comprises:

Feature extraction unit, is suitable for that the search query word of input is carried out to feature extraction and obtains characteristic of correspondence vector;

Sorter, the classification that is suitable for obtaining according to described proper vector described search query word lays particular stress on probability and sends to output unit;

Output unit, the classification that is suitable for analyzing described search query word lays particular stress on probability, determines category attribute the output of described search query word.

Alternatively, this device also comprises:

Labeled data acquiring unit, is suitable for obtaining the labeled data that marks classification;

Sampling unit, is suitable for sampling from the labeled data of each classification and, more than the first preset value, is less than the data of the second preset value quantity, obtains such other data from the sample survey;

Training unit, is suitable for data from the sample survey of all categories to train the disaggregated model that obtains described sorter.

Alternatively, described training unit, is suitable for, in the process of training, to different classes of different penalty factors being set; Wherein, the classification setting larger penalty factor less to data number after sampling.

Alternatively, the matrix that the described disaggregated model that described training unit obtains is m*n, m is classification number, and n is Characteristic Number, and each element a (i, j) of matrix represents the classification weights of j feature i classification;

Described training unit, is further adapted for the precision by reducing the classification weights in disaggregated model, reduces the shared storage space of disaggregated model; And/or, be further adapted for and use l1 regularization training pattern, to reduce the shared storage space of disaggregated model.

Alternatively, described labeled data acquiring unit, is suitable for obtaining the labeled data of artificial mark; And/or, be suitable in advance website links being carried out to classification annotation, specify the website links of clicking after search query word according to user search, set up the corresponding relation between described appointment search query word and the classification of the website links of clicking, obtain labeled data.

Alternatively, described training unit, is suitable for adopting Liblinear to train and obtain disaggregated model data from the sample survey of all categories;

Described feature extraction unit, is suitable for the search query word of input to carry out participle, utilizes the proper vector of the result structure libsvm form after participle.

Alternatively, this device also comprises: pretreatment unit, is suitable for the search query word of input to carry out pre-service;

Described pre-service comprises one or more in following processing: long word filters, deletes special character and deletes very word.

Alternatively, this device also comprises:

Buffer unit, is suitable for preserving the search query word of some and corresponding classification lays particular stress on probability;

Caching query unit, is suitable for according to the search query word query caching unit of input; If hit buffer memory, directly the classification of the search query word of described input is laid particular stress on to probability sends to described output unit; If do not hit buffer memory, the search query word of described input is sent to described feature extraction unit.

Alternatively, this device also comprises: data cached setting unit;

Described data cached setting unit, be suitable for finding out from determining that classification lays particular stress on the search query word of probability the search query word that is queried the front predetermined number that number of times is maximum, the search query word of this predetermined number and corresponding classification laid particular stress on to probability correspondence and be saved in described buffer unit;

Or,

For different content distributing network CDN, buffer unit is set respectively;

Described data cached setting unit, be suitable for the buffer unit to each CDN, lay particular stress on the search query word of probability from definite classification of accessing this CDN, find out the search query word that is queried the front predetermined number that number of times is maximum, the search query word of this predetermined number is laid particular stress on to probability correspondence with corresponding classification and be saved in the buffer unit of this CDN.

Alternatively, this device further comprises:

Sequencing unit, is suitable for according to the category attribute of the described search query word of determining, Search Results being sorted.

This search query word to input of the present invention is carried out feature extraction and is obtained characteristic of correspondence vector, the classification that obtains described search query word via query word sorter according to described proper vector lays particular stress on probability, the classification of analyzing described search query word lays particular stress on probability, determine the technical scheme of the category attribute of described search query word, by extracting the proper vector of search query word, this proper vector is input to the classification that query word sorter obtains described search query word and lays particular stress on probability, and lay particular stress on probability and determine the category attribute of this query word by analyzing classification, thereby for search provides basic basis, ensure the accuracy of search.And can provide foundation characteristic for successors such as searching orders.

Above-mentioned explanation is only the general introduction of technical solution of the present invention, in order to better understand technological means of the present invention, and can be implemented according to the content of instructions, and for above and other objects of the present invention, feature and advantage can be become apparent, below especially exemplified by the specific embodiment of the present invention.

Brief description of the drawings

By reading below detailed description of the preferred embodiment, various other advantage and benefits will become cheer and bright for those of ordinary skill in the art.Accompanying drawing is only for the object of preferred implementation is shown, and do not think limitation of the present invention.And in whole accompanying drawing, represent identical parts by identical reference symbol.In the accompanying drawings:

Fig. 1 shows a kind of according to an embodiment of the invention process flow diagram of method of definite search query word category attribute;

Fig. 2 shows and determines according to an embodiment of the invention the disaggregated model of sorter and utilize sorter to determine the process flow diagram of the method for the category attribute of search query word;

Fig. 3 shows a kind of according to an embodiment of the invention structural drawing of device of definite search query word category attribute;

Fig. 4 shows the structural drawing of the device of a kind of definite search query word category attribute of another embodiment according to the present invention.

Embodiment

Exemplary embodiment of the present disclosure is described below with reference to accompanying drawings in more detail.Although shown exemplary embodiment of the present disclosure in accompanying drawing, but should be appreciated that and can realize the disclosure and the embodiment that should do not set forth limits here with various forms.On the contrary, it is in order more thoroughly to understand the disclosure that these embodiment are provided, and can be by the those skilled in the art that conveys to complete the scope of the present disclosure.

Fig. 1 shows a kind of according to an embodiment of the invention process flow diagram of method of definite search query word category attribute.As shown in Figure 1, the method comprises:

Step S110, carries out feature extraction to the search query word of input and obtains characteristic of correspondence vector.

Step S120, the classification that obtains described search query word via query word sorter according to described proper vector lays particular stress on probability.

Step S130, the classification of analyzing described search query word lays particular stress on probability, determines the category attribute of described search query word.

Technical scheme shown in Fig. 1, by extracting the proper vector of search query word, this proper vector is input to the classification that query word sorter obtains described search query word and lays particular stress on probability, and lay particular stress on probability and determine the category attribute of this query word by analyzing classification, thereby for search provides basic basis, ensure the accuracy of search.And can provide foundation characteristic for successors such as searching orders.According to the category attribute of the described search query word of determining in step S130, Search Results is sorted.

In one embodiment of the invention, the acquisition process of the disaggregated model to query word sorter improves, can not satisfy the demands to overcome simple artificial labeled data, and due to problems such as the unbalanced disaggregated model inclinations causing of data of training.In one embodiment of the invention, before being carried out to feature extraction, the search query word of input also carries out the operations such as pre-service and caching query, to raise the efficiency.For explanation technique scheme, describe as an example of the flow process shown in Fig. 2 example below.

Fig. 2 shows and determines according to an embodiment of the invention the disaggregated model of sorter and utilize sorter to determine the process flow diagram of the method for the category attribute of search query word.As shown in Figure 2, the method comprises the step S220～step S224 of the disaggregated model of determining sorter, i.e. training study process under line; And utilize sorter to determine the step S230 of the category attribute of search query word～step S238, i.e. forecasting process on line.

Step S220, obtains the labeled data that marks classification.

In this step, can obtain the labeled data of artificial mark.Also can carry out classification annotation to website links in advance; Specify the website links of clicking after search query word according to user search, set up the corresponding relation between described appointment search query word and the classification of the website links of clicking, obtain labeled data.Or also can obtain labeled data in conjunction with above-mentioned two kinds of modes.

The study of SVM belongs to supervised learning, and its training process depends on a large amount of labeled data.Simple artificial labeled data can not meet the sorter that will reach certain accuracy for the demand of extensive labeled data amount, therefore a kind of semi-automatic training data mask method is provided in the present embodiment, has adopted a kind of search engine that uses to click the feedback method of mark training data indirectly.Particularly, first collect a large amount of artificial mark host, can adopt the open data of ODP or manually mark host, mark process sets up the corresponding relation of host to classification, then after user search word, clicked some host, the host setting up according to the first step is to the corresponding relation of classification, and we have just set up the corresponding relation of search word to classification indirectly by host.Here, host refers to the main frame of website, and the website links of website is one to one, is therefore the mark to website links in fact.

Step S222, the data of the some of sampling from the labeled data of each classification, obtain such other data from the sample survey.

There is the unbalanced problem of significant data in the labeled data obtaining via above-mentioned semi-automatic technique.Also may there is the unbalanced problem of data in artificial mark.Mean that for the training data of disaggregated model is unbalanced classifying face will be partial to classification one side that data are few, make disaggregated model tend to input example classification to be judged to be the more class of data, cause classification error, and this situation is more complicated in multivariate classification.For reducing and even avoid the unbalanced problem of data in multivariate classification model, the method for the main adjusting that adopts random sampling and classification punishment weights in the present embodiment.

Random sampling is: from the labeled data of each classification, sampling, more than the first preset value, is less than the data of the second preset value quantity, obtains such other data from the sample survey.From original training data, each classification is with the minimum m bar of sampling with equal probability data, and at most n bar data, can reduce the extremely unbalanced problem of data so to a certain extent.

Classification punishment: in the process of training, to different classes of different penalty factors is set; Wherein, the classification setting larger penalty factor less to data number after sampling.For the data after sampling, in training process, by applying different classes of penalty factor weight, the classification setting larger penalty factor less to data number after sampling, can avoid classifying face to be partial to the classification of minority certificate.

Step S224, trains and obtains disaggregated model data from the sample survey of all categories.

The characteristic of training process is extracted consistent with the process of prediction with generation, comprises participle, generating feature vector, generates libsvm formatted data.

Training.Utilize liblinear can realize the training of multivariate classification model, adopt Liblinear to train and obtain disaggregated model data from the sample survey of all categories.

In original liblinear realizes, utilize OpenMP that training process is rewritten as to many classification parallel trainings, can greatly improve training effectiveness.The disaggregated model of multivariate classification is a matrix M, and M is m*n matrix, and m is classification number, and n is Characteristic Number, and each element a (i, j) of matrix represents the classification weights of j feature i classification, and these weights are floating type.

Due to characteristic number, hundreds of thousands is up to a million at most at least, and actual classification number is also had an appointment 600, and gained disaggregated model matrix at least comprises more than one hundred million elements so, and the disaggregated model of original liblinear training output takies nearly 4G disk space.Therefore for improving disaggregated model loading efficiency in the time of off-line/on-line prediction, reduce disaggregated model size by two kinds of methods in one embodiment of the invention:

The first, by reducing the precision of the classification weights in disaggregated model, reduce the shared storage space of disaggregated model.For example, by the fraction part of weights is truncated to 6, reduced the disk storage of half.

The second, use l1 regularization training pattern.L1 canonical has the effect of feature selecting, and it is 0 that gained disaggregated model has a large amount of feature weights, can reduce equally disk storage.

Step S226, outputs to sorter by disaggregated model.

Obtained the disaggregated model of sorter by said process.On-line prediction process once below.

Step S230, receives the search query word of inputting.

Step S232, carries out pre-service to the search query word of input.

Because the input word in search engine search box is of all kinds, mixed and disorderly information potential must classification of disturbance effect, therefore needs search word to carry out pre-service.This process is actual in cleaning the process of word, comprises one or more in following processing: long word filters, deletes special character and deletes stop words.

Step S234, according to search query word query caching, if hit buffer memory, directly the classification of the described search query word of output lays particular stress on probability; If do not hit buffer memory, perform step S236.

Here, need preset buffer memory, in default buffer memory, preserve the search query word of some and corresponding classification and lay particular stress on probability.At set intervals the data in buffer memory are upgraded.

In default buffer memory, preserving the search query word of some and corresponding classification lays particular stress on probability and can be:

Or,

For different content distributing network CDN preset buffer memory respectively; For the buffer memory of each CDN, lay particular stress on the search query word of probability from definite classification of accessing this CDN, find out the search query word that is queried the front predetermined number that number of times is maximum, the search query word of this predetermined number is laid particular stress on to probability correspondence with corresponding classification and be saved in the buffer memory of this CDN.This mode has been considered the geographic difference of query word access, to improve cache hit rate.

Step S236, carries out feature extraction to search query word and obtains characteristic of correspondence vector.

This step specifically comprises: the search query word to input is carried out participle, utilizes the proper vector of the result structure libsvm form after participle.

The characteristic that this is is word because of the input data of sorter, this step, by search query word being converted to the proper vector that meets sorter input format, mainly comprises participle and structural attitude vector.Utilize the proper vector of the Structural Tectonics libsvm form after participle, the input that the vector of this form is sorter.Libsvm proper vector form adopts the vector space model of rarefaction representation, and the feature space dimension of its single query word after changing is 600,000 to 1,000,000 dimensions.

Step S238, is input to sorter by described proper vector, and sorter predicts based on disaggregated model, after the classification that obtains described search query word lays particular stress on probability, exports.

In an embodiment of the present invention, search query word generates the data of libsvm form after conversion, supposes that proper vector is column vector X, and disaggregated model matrix is M, and the probability that this word is predicted as to i class is p_i=X'*M_i.Its prediction is output as this word and is identified as the probable value under each classification.In addition, can also comprise some other numerical information, such as being predicted to be variance of different classes of probability etc.Facts have proved, these outputs are very useful in the calculating of later stage about the calculating of degree of confidence and according to condition filtration.

A polysemia of Chinese is very general, has also retained this characteristic in disaggregated model, and same query word has multiple classification output, and the standard of output is calculated in the numerical characteristics such as probability, variance under different classes of according to this word of model prediction.Such as by model prediction, the probability that " The Romance of the Three Kingdoms " belongs to books, TV play, industry and commerce is respectively 0.9,0.8,0.2.Can observe so, the probability that this word belongs to books, TV play will be significantly higher than the probability of industry and commerce, and therefore we think that this word can belong to books, TV play.Above recognition methods can realize by the sample average of calculating successively rear front n the probability of sequence.This is to draw in the direct relation of the probability under different classes of by sorter.

The classification of technique scheme of the present invention under can real-time analysis user query word, amount of training data is large and have higher accuracy, and the category of model precision of final training is higher.It is wider that classification contains face, can meet the demand of most of query word classify traffic and machine learning model, belongs to Internet basic assembly.

Fig. 3 shows a kind of according to an embodiment of the invention structural drawing of device of definite search query word category attribute.As shown in Figure 3, this device 300 of determining search query word category attribute comprises:

Feature extraction unit 301, is suitable for that the search query word of input is carried out to feature extraction and obtains characteristic of correspondence vector.

Sorter 302, the classification that is suitable for obtaining according to described proper vector described search query word lays particular stress on probability and sends to output unit.

Output unit 303, the classification that is suitable for analyzing described search query word lays particular stress on probability, determines category attribute the output of described search query word.

Fig. 4 shows the structural drawing of the device of a kind of definite search query word category attribute of another embodiment according to the present invention.As shown in Figure 4, this device 400 of determining search query word category attribute comprises:

Feature extraction unit 401, is suitable for that the search query word of input is carried out to feature extraction and obtains characteristic of correspondence vector.

Sorter 402, the classification that is suitable for obtaining according to described proper vector described search query word lays particular stress on probability and sends to output unit.

Output unit 403, the classification that is suitable for analyzing described search query word lays particular stress on probability, determines category attribute the output of described search query word.

In one embodiment of the invention, the device 400 of the definite search query word category attribute shown in Fig. 4 also comprises:

Labeled data acquiring unit 404, is suitable for obtaining the labeled data that marks classification;

Sampling unit 405, is suitable for sampling from the labeled data of each classification and, more than the first preset value, is less than the data of the second preset value quantity, obtains such other data from the sample survey;

Training unit 406, is suitable for data from the sample survey of all categories to train the disaggregated model that obtains described sorter.

In one embodiment of the invention, described training unit 406, is suitable for, in the process of training, to different classes of different penalty factors being set; Wherein, the classification setting larger penalty factor less to data number after sampling.

In one embodiment of the invention, the matrix that the described disaggregated model that described training unit 406 obtains is m*n, m is classification number, and n is Characteristic Number, and each element a (i, j) of matrix represents the classification weights of j feature i classification;

Described training unit 406, is further adapted for the precision by reducing the classification weights in disaggregated model, reduces the shared storage space of disaggregated model; And/or, be further adapted for and use l1 regularization training pattern, to reduce the shared storage space of disaggregated model.

In one embodiment of the invention, described labeled data acquiring unit 404, is suitable for obtaining the labeled data of artificial mark; And/or, be suitable in advance website links being carried out to classification annotation, specify the website links of clicking after search query word according to user search, set up the corresponding relation between described appointment search query word and the classification of the website links of clicking, obtain labeled data.

In one embodiment of the invention, described training unit 406, is suitable for adopting Liblinear to train and obtain disaggregated model data from the sample survey of all categories;

Described feature extraction unit 401, is suitable for the search query word of input to carry out participle, utilizes the proper vector of the result structure libsvm form after participle.

In one embodiment of the invention, the device 400 of the definite search query word category attribute shown in Fig. 4 also comprises: pretreatment unit 407, is suitable for the search query word of input to carry out pre-service; Described pre-service comprises one or more in following processing: long word filters, deletes special character and deletes very word.

Buffer unit 408, is suitable for preserving the search query word of some and corresponding classification lays particular stress on probability;

Caching query unit 409, is suitable for according to the search query word query caching unit 408 of input; If hit buffer memory, directly the classification of the search query word of described input is laid particular stress on to probability sends to described output unit 403; If do not hit buffer memory, the search query word of described input is sent to described feature extraction unit 401.

In one embodiment of the invention, the device 400 of the definite search query word category attribute shown in Fig. 4 also comprises: data cached setting unit 410;

Described data cached setting unit 410, be suitable for finding out from determining that classification lays particular stress on the search query word of probability the search query word that is queried the front predetermined number that number of times is maximum, the search query word of this predetermined number and corresponding classification laid particular stress on to probability correspondence and be saved in described buffer unit;

Or,

Described data cached setting unit 410, be suitable for the buffer unit to each CDN, lay particular stress on the search query word of probability from definite classification of accessing this CDN, find out the search query word that is queried the front predetermined number that number of times is maximum, the search query word of this predetermined number is laid particular stress on to probability correspondence with corresponding classification and be saved in the buffer unit of this CDN.

In one embodiment of the invention, the device 400 of the definite search query word category attribute shown in Fig. 4 also comprises: sequencing unit 411, is suitable for according to the category attribute of the described search query word of determining, Search Results being sorted.

In sum, this search query word to input of the present invention is carried out feature extraction and is obtained characteristic of correspondence vector, the classification that obtains described search query word via query word sorter according to described proper vector lays particular stress on probability, the classification of analyzing described search query word lays particular stress on probability, determine the technical scheme of the category attribute of described search query word, by extracting the proper vector of search query word, this proper vector is input to the classification that query word sorter obtains described search query word and lays particular stress on probability, and lay particular stress on probability and determine the category attribute of this query word by analyzing classification, thereby for search provides basic basis, ensure the accuracy of search.And can provide foundation characteristic for successors such as searching orders.

It should be noted that:

The algorithm providing at this is intrinsic not relevant to any certain computer, virtual system or miscellaneous equipment with demonstration.Various general-purpose systems also can with based on using together with this teaching.According to description above, it is apparent constructing the desired structure of this type systematic.In addition, the present invention is not also for any certain programmed language.It should be understood that and can utilize various programming languages to realize content of the present invention described here, and the description of above language-specific being done is in order to disclose preferred forms of the present invention.

In the instructions that provided herein, a large amount of details are described.But, can understand, embodiments of the invention can be put into practice in the situation that there is no these details.In some instances, be not shown specifically known method, structure and technology, so that not fuzzy understanding of this description.

Similarly, be to be understood that, in order to simplify the disclosure and to help to understand one or more in each inventive aspect, in the above in the description of exemplary embodiment of the present invention, each feature of the present invention is grouped together into single embodiment, figure or sometimes in its description.But, the method for the disclosure should be construed to the following intention of reflection: the present invention for required protection requires than the more feature of feature of clearly recording in each claim.Or rather, as reflected in claims below, inventive aspect is to be less than all features of disclosed single embodiment above.Therefore, claims of following embodiment are incorporated to this embodiment thus clearly, and wherein each claim itself is as independent embodiment of the present invention.

Those skilled in the art are appreciated that and can the module in the equipment in embodiment are adaptively changed and they are arranged in one or more equipment different from this embodiment.Module in embodiment or unit or assembly can be combined into a module or unit or assembly, and can put them in addition multiple submodules or subelement or sub-component.At least some in such feature and/or process or unit are mutually repelling, and can adopt any combination to combine all processes or the unit of disclosed all features in this instructions (comprising claim, summary and the accompanying drawing followed) and disclosed any method like this or equipment.Unless clearly statement in addition, in this instructions (comprising claim, summary and the accompanying drawing followed) disclosed each feature can be by providing identical, be equal to or the alternative features of similar object replaces.

In addition, those skilled in the art can understand, although embodiment more described herein comprise some feature instead of further feature included in other embodiment, the combination of the feature of different embodiment means within scope of the present invention and forms different embodiment.For example, in the following claims, the one of any of embodiment required for protection can be used with array mode arbitrarily.

All parts embodiment of the present invention can realize with hardware, or realizes with the software module of moving on one or more processor, or realizes with their combination.It will be understood by those of skill in the art that and can use in practice microprocessor or digital signal processor (DSP) to realize according to the some or all functions of the some or all parts in the device of definite search query word category attribute of the embodiment of the present invention.The present invention can also be embodied as part or all equipment or the device program (for example, computer program and computer program) for carrying out method as described herein.Realizing program of the present invention and can be stored on computer-readable medium like this, or can there is the form of one or more signal.Such signal can be downloaded and obtain from internet website, or provides on carrier signal, or provides with any other form.

It should be noted above-described embodiment the present invention will be described instead of limit the invention, and those skilled in the art can design alternative embodiment in the case of not departing from the scope of claims.In the claims, any reference symbol between bracket should be configured to limitations on claims.Word " comprises " not to be got rid of existence and is not listed as element or step in the claims.Being positioned at word " " before element or " one " does not get rid of and has multiple such elements.The present invention can be by means of including the hardware of some different elements and realizing by means of the computing machine of suitably programming.In the unit claim of having enumerated some devices, several in these devices can be to carry out imbody by same hardware branch.The use of word first, second and C grade does not represent any order.Can be title by these word explanations.

Claims

1. a method for definite search query word category attribute, wherein, the method comprises:

2. the method for claim 1, wherein the method also comprises the following steps of the disaggregated model that obtains described query word sorter:

Obtain the labeled data of mark classification;

3. the method as described in claim 1-2 any one, wherein, described data from the sample survey of all categories is trained and obtained disaggregated model and comprise:

4. the method as described in claim 1-3 any one, wherein, the matrix that described disaggregated model is m*n, m is classification number, and n is Characteristic Number, and each element a (i, j) of matrix represents the classification weights of j feature i classification;

5. the method as described in claim 1-4 any one, wherein, described in obtain mark classification labeled data comprise:

Obtain the labeled data of artificial mark;

And/or,

6. the method as described in any one in claim 1-5, wherein,

Described data from the sample survey of all categories is trained and obtained disaggregated model and comprise: adopt Liblinear to train and obtain disaggregated model data from the sample survey of all categories;

7. the method as described in claim 1-6 any one, wherein, before the described search query word to input is carried out feature extraction, the method also comprises: the search query word to input is carried out pre-service;

8. the method as described in claim 1-7 any one, wherein, the method also comprises: preset buffer memory, in described buffer memory, preserve the search query word of some and corresponding classification and lay particular stress on probability;

9. a device for definite search query word category attribute, wherein, this device comprises:

10. device as claimed in claim 9, wherein, this device further comprises: