CN104050240A - Method and device for determining categorical attribute of search query word - Google Patents

Method and device for determining categorical attribute of search query word Download PDF

Info

Publication number
CN104050240A
CN104050240A CN201410225991.3A CN201410225991A CN104050240A CN 104050240 A CN104050240 A CN 104050240A CN 201410225991 A CN201410225991 A CN 201410225991A CN 104050240 A CN104050240 A CN 104050240A
Authority
CN
China
Prior art keywords
query word
search query
classification
data
disaggregated model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201410225991.3A
Other languages
Chinese (zh)
Inventor
刘鎏
苏晓东
常富洋
王安滨
秦吉胜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Qihoo Technology Co Ltd
Qizhi Software Beijing Co Ltd
Original Assignee
Beijing Qihoo Technology Co Ltd
Qizhi Software Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Qihoo Technology Co Ltd, Qizhi Software Beijing Co Ltd filed Critical Beijing Qihoo Technology Co Ltd
Priority to CN201410225991.3A priority Critical patent/CN104050240A/en
Publication of CN104050240A publication Critical patent/CN104050240A/en
Priority to PCT/CN2015/079800 priority patent/WO2015180622A1/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method and a device for determining the categorical attribute of a search query word. The method comprises the following steps: performing characteristic extraction on an input search query word to obtain a corresponding characteristic vector; acquiring the categorical preference probability of the search query word by using a query word classifier according to the characteristic vector; analyzing the categorical preference probability of the search query word; determining the categorical attribute of the search query word. According to the technical scheme of the invention, the categorical preference probability of the search query word is obtained by extracting the characteristic vector of the search query word and inputting the characteristic vector into the query word classifier, and the categorical attribute of the search query word is determined by analyzing the categorical preference probability, so that a basis is laid for search, and the search accuracy is ensured. Moreover, basic characteristics can be provided for subsequent events such as search sequencing.

Description

A kind of method and apparatus of definite search query word category attribute
Technical field
The present invention relates to technical field of the computer network, be specifically related to a kind of method and apparatus of definite search query word category attribute.
Background technology
In once complete search, search engine can pass through pre-service search query word after receiving the search query word of user input conventionally, understand search query word, search file, the process such as sort, represent, and whole process need completes within the Millisecond time.And the classification of search query word is very important for understanding this process of query word, it can not only reflect the interest intention that active user is current, for this retrieval provides foundation, can serve as again the foundation characteristic of subsequent searches engine results order models, advertisement CTR prediction model, natural language model.
Support vector machine (SVM), for the one of supervised learning model in machine learning field, is proposed in nineteen ninety-five by people such as Vapnik.The most basic SVM model is " binary classification " model, and its mode of learning is maximize margin strategy.For simple linear separability data, go out hard interval support vector machine by " hard margin maximization " function learning; The data that can divide for approximately linear, learn out soft margin support vector machine by " soft margin maximization "; For the data of complete linearly inseparable, by data-mapping is arrived to more higher dimensional space, learn out soft margin support vector machine at higher dimensional space, after adopting in this course " kernel method " implicitly the inner product of the input space to be mapped to higher dimensional space, do again inner product, be equivalent at higher dimensional space study soft margin support vector machine.
Liblinear is by the Linear SVM software package that the Lin Zhiren of Taiwan Univ. teaches and research team develops, and has mainly realized linear multivariate classification and linear regression.Liblinear considers large-scale machines study application, and it does not introduce " kernel method ", but tentation data linearity or approximately linear can divide, directly training linear classifier.Through years development, liblinear is widely used in the solution of extensive classification and regression problem in industry member, and it is not only far superior to SVM in the performance of training and prediction, and its accuracy rate also reaches gratifying effect.From the angle of probability, in the text-processing project of industry reality, conventionally adopt boolean vector model, often hundreds of thousands is more than one hundred million at most at least for feature quantity, and the only sub-fraction data in Cover Characteristics space of the training data getting, therefore its linearly inseparable probability is just less.
The accuracy rate of visible existing search query word sorter still has much room for improvement.
Summary of the invention
In view of the above problems, the present invention has been proposed to a kind of a kind of search query word sorting technique and device that overcomes the problems referred to above or address the above problem is at least in part provided.
According to one aspect of the present invention, a kind of method of definite search query word category attribute is provided, wherein, the method comprises:
The search query word of input is carried out to feature extraction and obtain characteristic of correspondence vector;
The classification that obtains described search query word via query word sorter according to described proper vector lays particular stress on probability;
The classification of analyzing described search query word lays particular stress on probability, determines the category attribute of described search query word.
Alternatively, the method also comprises the following steps of the disaggregated model that obtains described query word sorter:
Obtain the labeled data of mark classification;
From the labeled data of each classification, sampling, more than the first preset value, is less than the data of the second preset value quantity, obtains such other data from the sample survey;
Data from the sample survey of all categories is trained and obtained disaggregated model.
Alternatively, described data from the sample survey of all categories trained and obtained disaggregated model and comprise:
In the process of training, to different classes of different penalty factors is set; Wherein, the classification setting larger penalty factor less to data number after sampling.
Alternatively, the matrix that described disaggregated model is m*n, m is classification number, and n is Characteristic Number, and each element a (i, j) of matrix represents the classification weights of j feature i classification;
Described data from the sample survey of all categories trained and obtained disaggregated model and also comprise:
By reducing the precision of the classification weights in disaggregated model, reduce the shared storage space of disaggregated model; And/or, use l1 regularization training pattern, to reduce the shared storage space of disaggregated model.
Alternatively, the labeled data that obtains mark classification described in comprises:
Obtain the labeled data of artificial mark;
And/or,
In advance website links is carried out to classification annotation; Specify the website links of clicking after search query word according to user search, set up the corresponding relation between described appointment search query word and the classification of the website links of clicking, obtain labeled data.
Alternatively, described data from the sample survey of all categories is trained and obtained disaggregated model and comprise: adopt Liblinear to train and obtain disaggregated model data from the sample survey of all categories;
The described search query word to input is carried out feature extraction and is obtained characteristic of correspondence vector and comprise: the search query word to input is carried out participle, utilizes the proper vector of the result structure libsvm form after participle.
Alternatively, before the described search query word to input is carried out feature extraction, the method also comprises: the search query word to input is carried out pre-service;
Described pre-service comprises one or more in following processing: long word filters, deletes special character and deletes stop words.
Alternatively, the method also comprises: preset buffer memory, and in described buffer memory, preserve the search query word of some and corresponding classification and lay particular stress on probability;
Before the described search query word to input is carried out feature extraction, the method also comprises: according to the search query word query caching of input; If hit buffer memory, directly the classification of the search query word of the described input of output lays particular stress on probability; If do not hit buffer memory, carry out described step and the subsequent step that the search query word of input is carried out to feature extraction.
Alternatively, described preset buffer memory, in described buffer memory, preserve the search query word of some and corresponding classification and lay particular stress on probability and comprise:
Find out from determining that classification lays particular stress on the search query word of probability the search query word that is queried the front predetermined number that number of times is maximum, the search query word of this predetermined number and corresponding classification are laid particular stress on to probability correspondence and be saved in described buffer memory;
Or,
For different content distributing network CDN preset buffer memory respectively; For the buffer memory of each CDN, lay particular stress on the search query word of probability from definite classification of accessing this CDN, find out the search query word that is queried the front predetermined number that number of times is maximum, the search query word of this predetermined number is laid particular stress on to probability correspondence with corresponding classification and be saved in the buffer memory of this CDN.
Alternatively, the method further comprises:
According to the category attribute of the described search query word of determining, Search Results is sorted.
According to another aspect of the present invention, a kind of device of definite search query word category attribute is provided, this device comprises:
Feature extraction unit, is suitable for that the search query word of input is carried out to feature extraction and obtains characteristic of correspondence vector;
Sorter, the classification that is suitable for obtaining according to described proper vector described search query word lays particular stress on probability and sends to output unit;
Output unit, the classification that is suitable for analyzing described search query word lays particular stress on probability, determines category attribute the output of described search query word.
Alternatively, this device also comprises:
Labeled data acquiring unit, is suitable for obtaining the labeled data that marks classification;
Sampling unit, is suitable for sampling from the labeled data of each classification and, more than the first preset value, is less than the data of the second preset value quantity, obtains such other data from the sample survey;
Training unit, is suitable for data from the sample survey of all categories to train the disaggregated model that obtains described sorter.
Alternatively, described training unit, is suitable for, in the process of training, to different classes of different penalty factors being set; Wherein, the classification setting larger penalty factor less to data number after sampling.
Alternatively, the matrix that the described disaggregated model that described training unit obtains is m*n, m is classification number, and n is Characteristic Number, and each element a (i, j) of matrix represents the classification weights of j feature i classification;
Described training unit, is further adapted for the precision by reducing the classification weights in disaggregated model, reduces the shared storage space of disaggregated model; And/or, be further adapted for and use l1 regularization training pattern, to reduce the shared storage space of disaggregated model.
Alternatively, described labeled data acquiring unit, is suitable for obtaining the labeled data of artificial mark; And/or, be suitable in advance website links being carried out to classification annotation, specify the website links of clicking after search query word according to user search, set up the corresponding relation between described appointment search query word and the classification of the website links of clicking, obtain labeled data.
Alternatively, described training unit, is suitable for adopting Liblinear to train and obtain disaggregated model data from the sample survey of all categories;
Described feature extraction unit, is suitable for the search query word of input to carry out participle, utilizes the proper vector of the result structure libsvm form after participle.
Alternatively, this device also comprises: pretreatment unit, is suitable for the search query word of input to carry out pre-service;
Described pre-service comprises one or more in following processing: long word filters, deletes special character and deletes very word.
Alternatively, this device also comprises:
Buffer unit, is suitable for preserving the search query word of some and corresponding classification lays particular stress on probability;
Caching query unit, is suitable for according to the search query word query caching unit of input; If hit buffer memory, directly the classification of the search query word of described input is laid particular stress on to probability sends to described output unit; If do not hit buffer memory, the search query word of described input is sent to described feature extraction unit.
Alternatively, this device also comprises: data cached setting unit;
Described data cached setting unit, be suitable for finding out from determining that classification lays particular stress on the search query word of probability the search query word that is queried the front predetermined number that number of times is maximum, the search query word of this predetermined number and corresponding classification laid particular stress on to probability correspondence and be saved in described buffer unit;
Or,
For different content distributing network CDN, buffer unit is set respectively;
Described data cached setting unit, be suitable for the buffer unit to each CDN, lay particular stress on the search query word of probability from definite classification of accessing this CDN, find out the search query word that is queried the front predetermined number that number of times is maximum, the search query word of this predetermined number is laid particular stress on to probability correspondence with corresponding classification and be saved in the buffer unit of this CDN.
Alternatively, this device further comprises:
Sequencing unit, is suitable for according to the category attribute of the described search query word of determining, Search Results being sorted.
This search query word to input of the present invention is carried out feature extraction and is obtained characteristic of correspondence vector, the classification that obtains described search query word via query word sorter according to described proper vector lays particular stress on probability, the classification of analyzing described search query word lays particular stress on probability, determine the technical scheme of the category attribute of described search query word, by extracting the proper vector of search query word, this proper vector is input to the classification that query word sorter obtains described search query word and lays particular stress on probability, and lay particular stress on probability and determine the category attribute of this query word by analyzing classification, thereby for search provides basic basis, ensure the accuracy of search.And can provide foundation characteristic for successors such as searching orders.
Above-mentioned explanation is only the general introduction of technical solution of the present invention, in order to better understand technological means of the present invention, and can be implemented according to the content of instructions, and for above and other objects of the present invention, feature and advantage can be become apparent, below especially exemplified by the specific embodiment of the present invention.
Brief description of the drawings
By reading below detailed description of the preferred embodiment, various other advantage and benefits will become cheer and bright for those of ordinary skill in the art.Accompanying drawing is only for the object of preferred implementation is shown, and do not think limitation of the present invention.And in whole accompanying drawing, represent identical parts by identical reference symbol.In the accompanying drawings:
Fig. 1 shows a kind of according to an embodiment of the invention process flow diagram of method of definite search query word category attribute;
Fig. 2 shows and determines according to an embodiment of the invention the disaggregated model of sorter and utilize sorter to determine the process flow diagram of the method for the category attribute of search query word;
Fig. 3 shows a kind of according to an embodiment of the invention structural drawing of device of definite search query word category attribute;
Fig. 4 shows the structural drawing of the device of a kind of definite search query word category attribute of another embodiment according to the present invention.
Embodiment
Exemplary embodiment of the present disclosure is described below with reference to accompanying drawings in more detail.Although shown exemplary embodiment of the present disclosure in accompanying drawing, but should be appreciated that and can realize the disclosure and the embodiment that should do not set forth limits here with various forms.On the contrary, it is in order more thoroughly to understand the disclosure that these embodiment are provided, and can be by the those skilled in the art that conveys to complete the scope of the present disclosure.
Fig. 1 shows a kind of according to an embodiment of the invention process flow diagram of method of definite search query word category attribute.As shown in Figure 1, the method comprises:
Step S110, carries out feature extraction to the search query word of input and obtains characteristic of correspondence vector.
Step S120, the classification that obtains described search query word via query word sorter according to described proper vector lays particular stress on probability.
Step S130, the classification of analyzing described search query word lays particular stress on probability, determines the category attribute of described search query word.
Technical scheme shown in Fig. 1, by extracting the proper vector of search query word, this proper vector is input to the classification that query word sorter obtains described search query word and lays particular stress on probability, and lay particular stress on probability and determine the category attribute of this query word by analyzing classification, thereby for search provides basic basis, ensure the accuracy of search.And can provide foundation characteristic for successors such as searching orders.According to the category attribute of the described search query word of determining in step S130, Search Results is sorted.
In one embodiment of the invention, the acquisition process of the disaggregated model to query word sorter improves, can not satisfy the demands to overcome simple artificial labeled data, and due to problems such as the unbalanced disaggregated model inclinations causing of data of training.In one embodiment of the invention, before being carried out to feature extraction, the search query word of input also carries out the operations such as pre-service and caching query, to raise the efficiency.For explanation technique scheme, describe as an example of the flow process shown in Fig. 2 example below.
Fig. 2 shows and determines according to an embodiment of the invention the disaggregated model of sorter and utilize sorter to determine the process flow diagram of the method for the category attribute of search query word.As shown in Figure 2, the method comprises the step S220~step S224 of the disaggregated model of determining sorter, i.e. training study process under line; And utilize sorter to determine the step S230 of the category attribute of search query word~step S238, i.e. forecasting process on line.
Step S220, obtains the labeled data that marks classification.
In this step, can obtain the labeled data of artificial mark.Also can carry out classification annotation to website links in advance; Specify the website links of clicking after search query word according to user search, set up the corresponding relation between described appointment search query word and the classification of the website links of clicking, obtain labeled data.Or also can obtain labeled data in conjunction with above-mentioned two kinds of modes.
The study of SVM belongs to supervised learning, and its training process depends on a large amount of labeled data.Simple artificial labeled data can not meet the sorter that will reach certain accuracy for the demand of extensive labeled data amount, therefore a kind of semi-automatic training data mask method is provided in the present embodiment, has adopted a kind of search engine that uses to click the feedback method of mark training data indirectly.Particularly, first collect a large amount of artificial mark host, can adopt the open data of ODP or manually mark host, mark process sets up the corresponding relation of host to classification, then after user search word, clicked some host, the host setting up according to the first step is to the corresponding relation of classification, and we have just set up the corresponding relation of search word to classification indirectly by host.Here, host refers to the main frame of website, and the website links of website is one to one, is therefore the mark to website links in fact.
Step S222, the data of the some of sampling from the labeled data of each classification, obtain such other data from the sample survey.
There is the unbalanced problem of significant data in the labeled data obtaining via above-mentioned semi-automatic technique.Also may there is the unbalanced problem of data in artificial mark.Mean that for the training data of disaggregated model is unbalanced classifying face will be partial to classification one side that data are few, make disaggregated model tend to input example classification to be judged to be the more class of data, cause classification error, and this situation is more complicated in multivariate classification.For reducing and even avoid the unbalanced problem of data in multivariate classification model, the method for the main adjusting that adopts random sampling and classification punishment weights in the present embodiment.
Random sampling is: from the labeled data of each classification, sampling, more than the first preset value, is less than the data of the second preset value quantity, obtains such other data from the sample survey.From original training data, each classification is with the minimum m bar of sampling with equal probability data, and at most n bar data, can reduce the extremely unbalanced problem of data so to a certain extent.
Classification punishment: in the process of training, to different classes of different penalty factors is set; Wherein, the classification setting larger penalty factor less to data number after sampling.For the data after sampling, in training process, by applying different classes of penalty factor weight, the classification setting larger penalty factor less to data number after sampling, can avoid classifying face to be partial to the classification of minority certificate.
Step S224, trains and obtains disaggregated model data from the sample survey of all categories.
The characteristic of training process is extracted consistent with the process of prediction with generation, comprises participle, generating feature vector, generates libsvm formatted data.
Training.Utilize liblinear can realize the training of multivariate classification model, adopt Liblinear to train and obtain disaggregated model data from the sample survey of all categories.
In original liblinear realizes, utilize OpenMP that training process is rewritten as to many classification parallel trainings, can greatly improve training effectiveness.The disaggregated model of multivariate classification is a matrix M, and M is m*n matrix, and m is classification number, and n is Characteristic Number, and each element a (i, j) of matrix represents the classification weights of j feature i classification, and these weights are floating type.
Due to characteristic number, hundreds of thousands is up to a million at most at least, and actual classification number is also had an appointment 600, and gained disaggregated model matrix at least comprises more than one hundred million elements so, and the disaggregated model of original liblinear training output takies nearly 4G disk space.Therefore for improving disaggregated model loading efficiency in the time of off-line/on-line prediction, reduce disaggregated model size by two kinds of methods in one embodiment of the invention:
The first, by reducing the precision of the classification weights in disaggregated model, reduce the shared storage space of disaggregated model.For example, by the fraction part of weights is truncated to 6, reduced the disk storage of half.
The second, use l1 regularization training pattern.L1 canonical has the effect of feature selecting, and it is 0 that gained disaggregated model has a large amount of feature weights, can reduce equally disk storage.
Step S226, outputs to sorter by disaggregated model.
Obtained the disaggregated model of sorter by said process.On-line prediction process once below.
Step S230, receives the search query word of inputting.
Step S232, carries out pre-service to the search query word of input.
Because the input word in search engine search box is of all kinds, mixed and disorderly information potential must classification of disturbance effect, therefore needs search word to carry out pre-service.This process is actual in cleaning the process of word, comprises one or more in following processing: long word filters, deletes special character and deletes stop words.
Step S234, according to search query word query caching, if hit buffer memory, directly the classification of the described search query word of output lays particular stress on probability; If do not hit buffer memory, perform step S236.
Here, need preset buffer memory, in default buffer memory, preserve the search query word of some and corresponding classification and lay particular stress on probability.At set intervals the data in buffer memory are upgraded.
In default buffer memory, preserving the search query word of some and corresponding classification lays particular stress on probability and can be:
Find out from determining that classification lays particular stress on the search query word of probability the search query word that is queried the front predetermined number that number of times is maximum, the search query word of this predetermined number and corresponding classification are laid particular stress on to probability correspondence and be saved in described buffer memory;
Or,
For different content distributing network CDN preset buffer memory respectively; For the buffer memory of each CDN, lay particular stress on the search query word of probability from definite classification of accessing this CDN, find out the search query word that is queried the front predetermined number that number of times is maximum, the search query word of this predetermined number is laid particular stress on to probability correspondence with corresponding classification and be saved in the buffer memory of this CDN.This mode has been considered the geographic difference of query word access, to improve cache hit rate.
Step S236, carries out feature extraction to search query word and obtains characteristic of correspondence vector.
This step specifically comprises: the search query word to input is carried out participle, utilizes the proper vector of the result structure libsvm form after participle.
The characteristic that this is is word because of the input data of sorter, this step, by search query word being converted to the proper vector that meets sorter input format, mainly comprises participle and structural attitude vector.Utilize the proper vector of the Structural Tectonics libsvm form after participle, the input that the vector of this form is sorter.Libsvm proper vector form adopts the vector space model of rarefaction representation, and the feature space dimension of its single query word after changing is 600,000 to 1,000,000 dimensions.
Step S238, is input to sorter by described proper vector, and sorter predicts based on disaggregated model, after the classification that obtains described search query word lays particular stress on probability, exports.
In an embodiment of the present invention, search query word generates the data of libsvm form after conversion, supposes that proper vector is column vector X, and disaggregated model matrix is M, and the probability that this word is predicted as to i class is p_i=X'*M_i.Its prediction is output as this word and is identified as the probable value under each classification.In addition, can also comprise some other numerical information, such as being predicted to be variance of different classes of probability etc.Facts have proved, these outputs are very useful in the calculating of later stage about the calculating of degree of confidence and according to condition filtration.
A polysemia of Chinese is very general, has also retained this characteristic in disaggregated model, and same query word has multiple classification output, and the standard of output is calculated in the numerical characteristics such as probability, variance under different classes of according to this word of model prediction.Such as by model prediction, the probability that " The Romance of the Three Kingdoms " belongs to books, TV play, industry and commerce is respectively 0.9,0.8,0.2.Can observe so, the probability that this word belongs to books, TV play will be significantly higher than the probability of industry and commerce, and therefore we think that this word can belong to books, TV play.Above recognition methods can realize by the sample average of calculating successively rear front n the probability of sequence.This is to draw in the direct relation of the probability under different classes of by sorter.
The classification of technique scheme of the present invention under can real-time analysis user query word, amount of training data is large and have higher accuracy, and the category of model precision of final training is higher.It is wider that classification contains face, can meet the demand of most of query word classify traffic and machine learning model, belongs to Internet basic assembly.
Fig. 3 shows a kind of according to an embodiment of the invention structural drawing of device of definite search query word category attribute.As shown in Figure 3, this device 300 of determining search query word category attribute comprises:
Feature extraction unit 301, is suitable for that the search query word of input is carried out to feature extraction and obtains characteristic of correspondence vector.
Sorter 302, the classification that is suitable for obtaining according to described proper vector described search query word lays particular stress on probability and sends to output unit.
Output unit 303, the classification that is suitable for analyzing described search query word lays particular stress on probability, determines category attribute the output of described search query word.
Fig. 4 shows the structural drawing of the device of a kind of definite search query word category attribute of another embodiment according to the present invention.As shown in Figure 4, this device 400 of determining search query word category attribute comprises:
Feature extraction unit 401, is suitable for that the search query word of input is carried out to feature extraction and obtains characteristic of correspondence vector.
Sorter 402, the classification that is suitable for obtaining according to described proper vector described search query word lays particular stress on probability and sends to output unit.
Output unit 403, the classification that is suitable for analyzing described search query word lays particular stress on probability, determines category attribute the output of described search query word.
In one embodiment of the invention, the device 400 of the definite search query word category attribute shown in Fig. 4 also comprises:
Labeled data acquiring unit 404, is suitable for obtaining the labeled data that marks classification;
Sampling unit 405, is suitable for sampling from the labeled data of each classification and, more than the first preset value, is less than the data of the second preset value quantity, obtains such other data from the sample survey;
Training unit 406, is suitable for data from the sample survey of all categories to train the disaggregated model that obtains described sorter.
In one embodiment of the invention, described training unit 406, is suitable for, in the process of training, to different classes of different penalty factors being set; Wherein, the classification setting larger penalty factor less to data number after sampling.
In one embodiment of the invention, the matrix that the described disaggregated model that described training unit 406 obtains is m*n, m is classification number, and n is Characteristic Number, and each element a (i, j) of matrix represents the classification weights of j feature i classification;
Described training unit 406, is further adapted for the precision by reducing the classification weights in disaggregated model, reduces the shared storage space of disaggregated model; And/or, be further adapted for and use l1 regularization training pattern, to reduce the shared storage space of disaggregated model.
In one embodiment of the invention, described labeled data acquiring unit 404, is suitable for obtaining the labeled data of artificial mark; And/or, be suitable in advance website links being carried out to classification annotation, specify the website links of clicking after search query word according to user search, set up the corresponding relation between described appointment search query word and the classification of the website links of clicking, obtain labeled data.
In one embodiment of the invention, described training unit 406, is suitable for adopting Liblinear to train and obtain disaggregated model data from the sample survey of all categories;
Described feature extraction unit 401, is suitable for the search query word of input to carry out participle, utilizes the proper vector of the result structure libsvm form after participle.
In one embodiment of the invention, the device 400 of the definite search query word category attribute shown in Fig. 4 also comprises: pretreatment unit 407, is suitable for the search query word of input to carry out pre-service; Described pre-service comprises one or more in following processing: long word filters, deletes special character and deletes very word.
In one embodiment of the invention, the device 400 of the definite search query word category attribute shown in Fig. 4 also comprises:
Buffer unit 408, is suitable for preserving the search query word of some and corresponding classification lays particular stress on probability;
Caching query unit 409, is suitable for according to the search query word query caching unit 408 of input; If hit buffer memory, directly the classification of the search query word of described input is laid particular stress on to probability sends to described output unit 403; If do not hit buffer memory, the search query word of described input is sent to described feature extraction unit 401.
In one embodiment of the invention, the device 400 of the definite search query word category attribute shown in Fig. 4 also comprises: data cached setting unit 410;
Described data cached setting unit 410, be suitable for finding out from determining that classification lays particular stress on the search query word of probability the search query word that is queried the front predetermined number that number of times is maximum, the search query word of this predetermined number and corresponding classification laid particular stress on to probability correspondence and be saved in described buffer unit;
Or,
For different content distributing network CDN, buffer unit is set respectively;
Described data cached setting unit 410, be suitable for the buffer unit to each CDN, lay particular stress on the search query word of probability from definite classification of accessing this CDN, find out the search query word that is queried the front predetermined number that number of times is maximum, the search query word of this predetermined number is laid particular stress on to probability correspondence with corresponding classification and be saved in the buffer unit of this CDN.
In one embodiment of the invention, the device 400 of the definite search query word category attribute shown in Fig. 4 also comprises: sequencing unit 411, is suitable for according to the category attribute of the described search query word of determining, Search Results being sorted.
In sum, this search query word to input of the present invention is carried out feature extraction and is obtained characteristic of correspondence vector, the classification that obtains described search query word via query word sorter according to described proper vector lays particular stress on probability, the classification of analyzing described search query word lays particular stress on probability, determine the technical scheme of the category attribute of described search query word, by extracting the proper vector of search query word, this proper vector is input to the classification that query word sorter obtains described search query word and lays particular stress on probability, and lay particular stress on probability and determine the category attribute of this query word by analyzing classification, thereby for search provides basic basis, ensure the accuracy of search.And can provide foundation characteristic for successors such as searching orders.
It should be noted that:
The algorithm providing at this is intrinsic not relevant to any certain computer, virtual system or miscellaneous equipment with demonstration.Various general-purpose systems also can with based on using together with this teaching.According to description above, it is apparent constructing the desired structure of this type systematic.In addition, the present invention is not also for any certain programmed language.It should be understood that and can utilize various programming languages to realize content of the present invention described here, and the description of above language-specific being done is in order to disclose preferred forms of the present invention.
In the instructions that provided herein, a large amount of details are described.But, can understand, embodiments of the invention can be put into practice in the situation that there is no these details.In some instances, be not shown specifically known method, structure and technology, so that not fuzzy understanding of this description.
Similarly, be to be understood that, in order to simplify the disclosure and to help to understand one or more in each inventive aspect, in the above in the description of exemplary embodiment of the present invention, each feature of the present invention is grouped together into single embodiment, figure or sometimes in its description.But, the method for the disclosure should be construed to the following intention of reflection: the present invention for required protection requires than the more feature of feature of clearly recording in each claim.Or rather, as reflected in claims below, inventive aspect is to be less than all features of disclosed single embodiment above.Therefore, claims of following embodiment are incorporated to this embodiment thus clearly, and wherein each claim itself is as independent embodiment of the present invention.
Those skilled in the art are appreciated that and can the module in the equipment in embodiment are adaptively changed and they are arranged in one or more equipment different from this embodiment.Module in embodiment or unit or assembly can be combined into a module or unit or assembly, and can put them in addition multiple submodules or subelement or sub-component.At least some in such feature and/or process or unit are mutually repelling, and can adopt any combination to combine all processes or the unit of disclosed all features in this instructions (comprising claim, summary and the accompanying drawing followed) and disclosed any method like this or equipment.Unless clearly statement in addition, in this instructions (comprising claim, summary and the accompanying drawing followed) disclosed each feature can be by providing identical, be equal to or the alternative features of similar object replaces.
In addition, those skilled in the art can understand, although embodiment more described herein comprise some feature instead of further feature included in other embodiment, the combination of the feature of different embodiment means within scope of the present invention and forms different embodiment.For example, in the following claims, the one of any of embodiment required for protection can be used with array mode arbitrarily.
All parts embodiment of the present invention can realize with hardware, or realizes with the software module of moving on one or more processor, or realizes with their combination.It will be understood by those of skill in the art that and can use in practice microprocessor or digital signal processor (DSP) to realize according to the some or all functions of the some or all parts in the device of definite search query word category attribute of the embodiment of the present invention.The present invention can also be embodied as part or all equipment or the device program (for example, computer program and computer program) for carrying out method as described herein.Realizing program of the present invention and can be stored on computer-readable medium like this, or can there is the form of one or more signal.Such signal can be downloaded and obtain from internet website, or provides on carrier signal, or provides with any other form.
It should be noted above-described embodiment the present invention will be described instead of limit the invention, and those skilled in the art can design alternative embodiment in the case of not departing from the scope of claims.In the claims, any reference symbol between bracket should be configured to limitations on claims.Word " comprises " not to be got rid of existence and is not listed as element or step in the claims.Being positioned at word " " before element or " one " does not get rid of and has multiple such elements.The present invention can be by means of including the hardware of some different elements and realizing by means of the computing machine of suitably programming.In the unit claim of having enumerated some devices, several in these devices can be to carry out imbody by same hardware branch.The use of word first, second and C grade does not represent any order.Can be title by these word explanations.

Claims (10)

1. a method for definite search query word category attribute, wherein, the method comprises:
The search query word of input is carried out to feature extraction and obtain characteristic of correspondence vector;
The classification that obtains described search query word via query word sorter according to described proper vector lays particular stress on probability;
The classification of analyzing described search query word lays particular stress on probability, determines the category attribute of described search query word.
2. the method for claim 1, wherein the method also comprises the following steps of the disaggregated model that obtains described query word sorter:
Obtain the labeled data of mark classification;
From the labeled data of each classification, sampling, more than the first preset value, is less than the data of the second preset value quantity, obtains such other data from the sample survey;
Data from the sample survey of all categories is trained and obtained disaggregated model.
3. the method as described in claim 1-2 any one, wherein, described data from the sample survey of all categories is trained and obtained disaggregated model and comprise:
In the process of training, to different classes of different penalty factors is set; Wherein, the classification setting larger penalty factor less to data number after sampling.
4. the method as described in claim 1-3 any one, wherein, the matrix that described disaggregated model is m*n, m is classification number, and n is Characteristic Number, and each element a (i, j) of matrix represents the classification weights of j feature i classification;
Described data from the sample survey of all categories trained and obtained disaggregated model and also comprise:
By reducing the precision of the classification weights in disaggregated model, reduce the shared storage space of disaggregated model; And/or, use l1 regularization training pattern, to reduce the shared storage space of disaggregated model.
5. the method as described in claim 1-4 any one, wherein, described in obtain mark classification labeled data comprise:
Obtain the labeled data of artificial mark;
And/or,
In advance website links is carried out to classification annotation; Specify the website links of clicking after search query word according to user search, set up the corresponding relation between described appointment search query word and the classification of the website links of clicking, obtain labeled data.
6. the method as described in any one in claim 1-5, wherein,
Described data from the sample survey of all categories is trained and obtained disaggregated model and comprise: adopt Liblinear to train and obtain disaggregated model data from the sample survey of all categories;
The described search query word to input is carried out feature extraction and is obtained characteristic of correspondence vector and comprise: the search query word to input is carried out participle, utilizes the proper vector of the result structure libsvm form after participle.
7. the method as described in claim 1-6 any one, wherein, before the described search query word to input is carried out feature extraction, the method also comprises: the search query word to input is carried out pre-service;
Described pre-service comprises one or more in following processing: long word filters, deletes special character and deletes stop words.
8. the method as described in claim 1-7 any one, wherein, the method also comprises: preset buffer memory, in described buffer memory, preserve the search query word of some and corresponding classification and lay particular stress on probability;
Before the described search query word to input is carried out feature extraction, the method also comprises: according to the search query word query caching of input; If hit buffer memory, directly the classification of the search query word of the described input of output lays particular stress on probability; If do not hit buffer memory, carry out described step and the subsequent step that the search query word of input is carried out to feature extraction.
9. a device for definite search query word category attribute, wherein, this device comprises:
Feature extraction unit, is suitable for that the search query word of input is carried out to feature extraction and obtains characteristic of correspondence vector;
Sorter, the classification that is suitable for obtaining according to described proper vector described search query word lays particular stress on probability and sends to output unit;
Output unit, the classification that is suitable for analyzing described search query word lays particular stress on probability, determines category attribute the output of described search query word.
10. device as claimed in claim 9, wherein, this device further comprises:
Labeled data acquiring unit, is suitable for obtaining the labeled data that marks classification;
Sampling unit, is suitable for sampling from the labeled data of each classification and, more than the first preset value, is less than the data of the second preset value quantity, obtains such other data from the sample survey;
Training unit, is suitable for data from the sample survey of all categories to train the disaggregated model that obtains described sorter.
CN201410225991.3A 2014-05-26 2014-05-26 Method and device for determining categorical attribute of search query word Pending CN104050240A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201410225991.3A CN104050240A (en) 2014-05-26 2014-05-26 Method and device for determining categorical attribute of search query word
PCT/CN2015/079800 WO2015180622A1 (en) 2014-05-26 2015-05-26 Method and apparatus for determining categorical attribute of queried word in search

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410225991.3A CN104050240A (en) 2014-05-26 2014-05-26 Method and device for determining categorical attribute of search query word

Publications (1)

Publication Number Publication Date
CN104050240A true CN104050240A (en) 2014-09-17

Family

ID=51503073

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410225991.3A Pending CN104050240A (en) 2014-05-26 2014-05-26 Method and device for determining categorical attribute of search query word

Country Status (2)

Country Link
CN (1) CN104050240A (en)
WO (1) WO2015180622A1 (en)

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105095187A (en) * 2015-08-07 2015-11-25 广州神马移动信息科技有限公司 Search intention identification method and device
CN105101124A (en) * 2015-08-07 2015-11-25 北京奇虎科技有限公司 Method and device for marking category of short messages
WO2015180622A1 (en) * 2014-05-26 2015-12-03 北京奇虎科技有限公司 Method and apparatus for determining categorical attribute of queried word in search
CN105893533A (en) * 2016-03-31 2016-08-24 北京奇艺世纪科技有限公司 Text matching method and device
CN107291775A (en) * 2016-04-11 2017-10-24 北京京东尚科信息技术有限公司 The reparation language material generation method and device of error sample
CN107315759A (en) * 2016-04-26 2017-11-03 百度(美国)有限责任公司 Sort out method, device and processing system, the method for generating classification model of keyword
WO2017201907A1 (en) * 2016-05-24 2017-11-30 百度在线网络技术(北京)有限公司 Search term classification method and device
CN107621892A (en) * 2017-10-18 2018-01-23 北京百度网讯科技有限公司 For obtaining the method and device of information
CN108241650A (en) * 2016-12-23 2018-07-03 北京国双科技有限公司 The training method and device of training criteria for classification
CN108763200A (en) * 2018-05-15 2018-11-06 达而观信息科技(上海)有限公司 Chinese word cutting method and device
WO2019180515A1 (en) * 2018-03-23 2019-09-26 International Business Machines Corporation Query recognition resiliency determination in virtual agent systems
CN110674372A (en) * 2019-09-29 2020-01-10 北京百度网讯科技有限公司 Classification method and device
CN113343101A (en) * 2021-06-28 2021-09-03 支付宝(杭州)信息技术有限公司 Object sorting method and system
WO2023071122A1 (en) * 2021-10-29 2023-05-04 广东坚美铝型材厂(集团)有限公司 Semantic feature self-learning method based on nonuniform intervals, and device and storage medium

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109145245A (en) * 2018-07-26 2019-01-04 腾讯科技(深圳)有限公司 Predict method, apparatus, computer equipment and the storage medium of clicking rate
CN111061835B (en) * 2019-12-17 2023-09-22 医渡云(北京)技术有限公司 Query method and device, electronic equipment and computer readable storage medium
CN112861956A (en) * 2021-02-01 2021-05-28 浪潮云信息技术股份公司 Water pollution model construction method based on data analysis
CN114861057B (en) * 2022-05-17 2023-05-30 北京百度网讯科技有限公司 Resource sending method, training of recommendation model and device

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103020164A (en) * 2012-11-26 2013-04-03 华北电力大学 Semantic search method based on multi-semantic analysis and personalized sequencing
CN103123653A (en) * 2013-03-15 2013-05-29 山东浪潮齐鲁软件产业股份有限公司 Search engine retrieving ordering method based on Bayesian classification learning
CN103164454A (en) * 2011-12-15 2013-06-19 百度在线网络技术(北京)有限公司 Keyword grouping method and keyword grouping system
CN103425677A (en) * 2012-05-18 2013-12-04 阿里巴巴集团控股有限公司 Method for determining classified models of keywords and method and device for classifying keywords

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103123636B (en) * 2011-11-21 2016-04-27 北京百度网讯科技有限公司 Set up the method and apparatus of the method for entry disaggregated model, entry automatic classification
CN103106262B (en) * 2013-01-28 2016-05-11 新浪网技术(中国)有限公司 The method and apparatus that document classification, supporting vector machine model generate
CN103810264B (en) * 2014-01-27 2017-06-06 西安理工大学 The web page text sorting technique of feature based selection
CN104050240A (en) * 2014-05-26 2014-09-17 北京奇虎科技有限公司 Method and device for determining categorical attribute of search query word

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103164454A (en) * 2011-12-15 2013-06-19 百度在线网络技术(北京)有限公司 Keyword grouping method and keyword grouping system
CN103425677A (en) * 2012-05-18 2013-12-04 阿里巴巴集团控股有限公司 Method for determining classified models of keywords and method and device for classifying keywords
CN103020164A (en) * 2012-11-26 2013-04-03 华北电力大学 Semantic search method based on multi-semantic analysis and personalized sequencing
CN103123653A (en) * 2013-03-15 2013-05-29 山东浪潮齐鲁软件产业股份有限公司 Search engine retrieving ordering method based on Bayesian classification learning

Cited By (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015180622A1 (en) * 2014-05-26 2015-12-03 北京奇虎科技有限公司 Method and apparatus for determining categorical attribute of queried word in search
CN105095187A (en) * 2015-08-07 2015-11-25 广州神马移动信息科技有限公司 Search intention identification method and device
CN105101124A (en) * 2015-08-07 2015-11-25 北京奇虎科技有限公司 Method and device for marking category of short messages
CN105893533A (en) * 2016-03-31 2016-08-24 北京奇艺世纪科技有限公司 Text matching method and device
CN107291775B (en) * 2016-04-11 2020-07-31 北京京东尚科信息技术有限公司 Method and device for generating repairing linguistic data of error sample
CN107291775A (en) * 2016-04-11 2017-10-24 北京京东尚科信息技术有限公司 The reparation language material generation method and device of error sample
CN107315759A (en) * 2016-04-26 2017-11-03 百度(美国)有限责任公司 Sort out method, device and processing system, the method for generating classification model of keyword
WO2017201907A1 (en) * 2016-05-24 2017-11-30 百度在线网络技术(北京)有限公司 Search term classification method and device
CN108241650A (en) * 2016-12-23 2018-07-03 北京国双科技有限公司 The training method and device of training criteria for classification
CN108241650B (en) * 2016-12-23 2020-08-11 北京国双科技有限公司 Training method and device for training classification standard
CN107621892A (en) * 2017-10-18 2018-01-23 北京百度网讯科技有限公司 For obtaining the method and device of information
CN107621892B (en) * 2017-10-18 2021-03-09 北京百度网讯科技有限公司 Method and device for acquiring information
US10831797B2 (en) 2018-03-23 2020-11-10 International Business Machines Corporation Query recognition resiliency determination in virtual agent systems
WO2019180515A1 (en) * 2018-03-23 2019-09-26 International Business Machines Corporation Query recognition resiliency determination in virtual agent systems
CN108763200A (en) * 2018-05-15 2018-11-06 达而观信息科技(上海)有限公司 Chinese word cutting method and device
CN110674372A (en) * 2019-09-29 2020-01-10 北京百度网讯科技有限公司 Classification method and device
CN110674372B (en) * 2019-09-29 2022-07-26 北京百度网讯科技有限公司 Classification method and device
CN113343101A (en) * 2021-06-28 2021-09-03 支付宝(杭州)信息技术有限公司 Object sorting method and system
WO2023071122A1 (en) * 2021-10-29 2023-05-04 广东坚美铝型材厂(集团)有限公司 Semantic feature self-learning method based on nonuniform intervals, and device and storage medium

Also Published As

Publication number Publication date
WO2015180622A1 (en) 2015-12-03

Similar Documents

Publication Publication Date Title
CN104050240A (en) Method and device for determining categorical attribute of search query word
US11238310B2 (en) Training data acquisition method and device, server and storage medium
CN112084327B (en) Classification of sparsely labeled text documents while preserving semantics
CN107168992A (en) Article sorting technique and device, equipment and computer-readable recording medium based on artificial intelligence
CN105677844A (en) Mobile advertisement big data directional pushing and user cross-screen recognition method
CN109948121A (en) Article similarity method for digging, system, equipment and storage medium
CN106202514A (en) Accident based on Agent is across the search method of media information and system
CN103646070A (en) Data processing method and device for search engine
JP6976910B2 (en) Data classification system, data classification method, and data classification device
CN108241867B (en) Classification method and device
CN109690581A (en) User guided system and method
CN110263151A (en) A kind of enigmatic language justice learning method towards multi-angle of view multi-tag data
CN115392237B (en) Emotion analysis model training method, device, equipment and storage medium
CN105574213A (en) Microblog recommendation method and device based on data mining technology
US20220383036A1 (en) Clustering data using neural networks based on normalized cuts
CN108959550B (en) User focus mining method, device, equipment and computer readable medium
CN111078881B (en) Fine-grained sentiment analysis method and system, electronic equipment and storage medium
CN105164672A (en) Content classification
CN110909768B (en) Method and device for acquiring marked data
CN110516164A (en) A kind of information recommendation method, device, equipment and storage medium
CN110555448A (en) Method and system for subdividing dispatch area
CN107368464B (en) Method and device for acquiring bidding product information
CN111753151A (en) Service recommendation method based on internet user behaviors
CN111445280A (en) Model generation method, restaurant ranking method, system, device and medium
CN109684467A (en) A kind of classification method and device of text

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20140917