CN104615605B

CN104615605B - The method and apparatus of classification for prediction data object

Info

Publication number: CN104615605B
Application number: CN201310542419.5A
Authority: CN
Inventors: 陈明修; 董凡
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2013-11-05
Filing date: 2013-11-05
Publication date: 2018-07-24
Anticipated expiration: 2033-11-05
Also published as: CN104615605A

Abstract

This application involves the method and apparatus of the classification for prediction data object.This method includes：At least one characteristics of objects is extracted from data object to be predicted；According to characteristics of objects, from being in advance based in the characteristics tree in database constructed by data with existing object and corresponding data object classification, characteristic set is obtained, characteristic set includes that the single characteristics of objects contacted is not present with other characteristics of objects in the characteristics of objects pair and characteristics of objects for exist in characteristics of objects contact；According to characteristic set, from being in advance based in the feature classification probability distribution that data with existing object and corresponding data object classification and characteristics tree are counted in database, obtain and each characteristics of objects pair or the corresponding each classification probability distribution of characteristics of objects in characteristic set；And according to each classification probability distribution, determine the prediction classification set of data object to be predicted.According to the scheme of the application, the accuracy rate of the classification prediction to data object can be improved.

Description

The method and apparatus of classification for prediction data object

Technical field

This application involves data processing field, relate more specifically to a kind of method of the classification for prediction data object and Device.

Background technology

With the continuous development of online data interaction, for number of site server, data object is being obtained After essential information such as title, attribute description etc., generally require data object being suspended on the classification of backstage, so as to follow-up conduct The foundation of the navigation of data object classification, various dimension datas statistics, product library construction etc. in search.Therefore, it is necessary to data pair The classification of elephant is predicted, with the association classification of the determination data object.

It is to click dictionary based on classification to carry out, wherein classification is clicked in a kind of classification prediction scheme of prior art Dictionary is the classification point that each word is counted according to the historical query word and the corresponding classification click condition of the historical query word of user Hit distribution.More specifically, when needing the classification to some data object to predict, first to the title of the data object Word segmentation processing is carried out to obtain at least one word, the classification click point that dictionary counts each word is then clicked according to above-mentioned classification Cloth, and it is chosen at the prediction classification for occurring most classifications in all words as the data object.

However, the classification due to user is clicked than sparse, the data of magnanimity can not be covered；The input of partial query word is also The phenomenon that along with malicious user brush query word（Certain user is inquired associated with itself to improve repeatedly using certain query words The clicking rate of information）, it is not very accurate to lead to the data that the classification of word is clicked, and seriously affects the class obtained with these data predictions Purpose accuracy rate.In addition, when predicting classification may due to title in certain repeating for word and cause to predict One inaccurate classification.

Therefore, a kind of improved classification Predicting Technique of demand, to overcome the above problem in the prior art, to improve logarithm The accuracy rate predicted according to the classification of object.

Invention content

The application's is designed to provide a kind of technology of the classification for prediction data object, can be to data object Classification more accurately predicted, to determine the association classification of data object.

Specifically, according to the one side of the embodiment of the present application, a kind of classification for prediction data object is provided Method, which is characterized in that including：At least one characteristics of objects is extracted from data object to be predicted；According to characteristics of objects, from pre- First based in data with existing object in database and the characteristics tree constructed by corresponding data object classification, characteristic set is obtained, Characteristic set includes to be not present with other characteristics of objects in the characteristics of objects pair and characteristics of objects for exist in characteristics of objects contact The single characteristics of objects of contact；According to characteristic set, from being in advance based on data with existing object and corresponding data pair in database In the feature counted as classification and characteristics tree-classification probability distribution, obtain with characteristic set in each characteristics of objects pair or The corresponding each classification probability distribution of characteristics of objects；And according to each classification probability distribution, determine data object to be predicted Predict classification set.

According to the other side of the embodiment of the present application, a kind of class destination device for prediction data object is provided, It is characterized in that, including：Characteristic extracting module, for extracting at least one characteristics of objects from data object to be predicted；First obtains Modulus block, for according to characteristics of objects, from being in advance based on data with existing object and corresponding data object classification institute in database In the characteristics tree of structure, characteristic set is obtained, characteristic set includes the characteristics of objects pair and right for existing in characteristics of objects contact As the single characteristics of objects contacted is not present with other characteristics of objects in feature；Second acquisition module is used for according to characteristic set, From being in advance based on feature-classification that data with existing object and corresponding data object classification and characteristics tree are counted in database In probability distribution, obtain and each characteristics of objects pair or the corresponding each classification probability distribution of characteristics of objects in characteristic set； And classification determining module, for according to each classification probability distribution, determining the prediction classification set of data object to be predicted.

Compared with prior art, the scheme of the application is based on database（Such as site databases）Interior data with existing object and It corresponds to classification to build tree enhancing naive Bayesian network model（Characteristics tree）And classification prediction is carried out based on the model, make The related data that full site databases must be covered improves the accuracy rate of classification prediction.Divide in addition, being used in the scheme of the application All unduplicated words after word build tree enhancing naive Bayesian network model as feature, so ensure that data object Deviation will not be generated because of certain dittographs in classification prediction, improve the accuracy rate of classification prediction.In addition, the side of the application Case reduces the condition connected between node when utility tree enhances naive Bayesian network, and allow each node with it is more Other nodes be attached, be greatly enriched entire tree enhancing naive Bayesian network, further improve classification prediction Accuracy rate.

Description of the drawings

Attached drawing described herein is used for providing further understanding of the present application, constitutes part of this application, this Shen Illustrative embodiments and their description please do not constitute the improper restriction to the application for explaining the application.In the accompanying drawings：

Fig. 1 is the flow chart according to the method for the classification for prediction data object of the application one embodiment；

Fig. 2 is the flow chart according to the method for construction feature tree of the application one embodiment；

Fig. 3 is the flow chart according to the method for construction feature tree of the more specific embodiment of the application；

Fig. 4 is the flow chart according to the method for statistical nature-classification probability distribution of the application one embodiment；With And

Fig. 5 is the structure diagram according to the class destination device for prediction data object of the application one embodiment.

Specific implementation mode

The main thought of the application is that, by by database（Such as site databases）Interior data with existing object and its The information of corresponding classification is as original training data, structure tree enhancing naive Bayesian network, to carry out data object to be predicted Classification prediction, with the association classification of determination data object to be predicted.Specifically, by being based on data with existing pair in database As and its information of corresponding classification come construction feature tree, and the letter based on data with existing object in database and its correspondence classification Breath and characteristics tree carry out statistical nature-classification probability distribution, to make the characteristics tree obtained in this way and feature-classification probability distribution Subsequently to treat the foundation that prediction data object carries out classification prediction.

In addition, the thought of the application also resides in, during the foundation of tree enhancing naive Bayesian network, advanced optimize Connection possibility between tree enhancing naive Bayesian network node, number can be connected by improving the maximum of each node.To prevent So that Partial Feature can not be with other spies due to the sparse of tree enhancing naive Bayesian network during progress classification prediction Sign connection, cause prediction result can because be characterized number of combinations it is less due to can not cover more related classifications or generate biasing. A node is typically only allowed at most to be saved with other two specifically, the application breaks traditional tree enhancing naive Bayesian network The thought that point is attached, and a node is attached with more other nodes, such as can reach and 100 A other nodes are attached, so that entire tree enhancing naive Bayesian network is more intensive, covering more fully feature, And then improve the predictablity rate of data object classification.

To keep the purpose, technical scheme and advantage of the application clearer, below in conjunction with the application specific embodiment and Technical scheme is clearly and completely described in corresponding attached drawing, it is clear that described embodiment is only the application one Section Example, instead of all the embodiments.Based on the embodiment in the application, those of ordinary skill in the art are not doing Go out the every other embodiment obtained under the premise of creative work, shall fall in the protection scope of this application.

The classification prediction scheme of the application can be applied to carry out the various application scenarios of classification prediction, that is, be suitable for needle Classification prediction to various data objects.For example, can be adapted for various Website servers to its business object or service object Deng classification（Classification）It is predicted.In a typical case scene, the scheme of the application can be applied to e-business network Site server predicts the classification of extensive stock, to determine the classification with commodity association.It should be pointed out that the application is to applied field Scape is simultaneously not intended to be limited in any, but can be adapted for other any suitable classification prediction scenes of existing or future exploitation.

Referring to Fig.1, the method that Fig. 1 shows the classification for prediction data object according to the application one embodiment 100 flow chart.

As shown in Figure 1, at step S110, at least one characteristics of objects is extracted from data object to be predicted.

Specifically, can be extracted from the information such as the title of data object to be predicted, abstract, details, attribute At least one characteristics of objects.In a typical embodiment, at least one can be extracted from the title of data object to be predicted A characteristics of objects.For the sake of for ease of description, come all by taking extracting object feature in the title from data object to be predicted as an example below Embodiments herein is described.It, can also be from waiting for but it will be understood by those skilled in the art that in the other embodiments of the application Extracting object feature in the information such as abstract, details, the attribute of prediction data object.

In one embodiment of the application, the mark of prediction data object can be treated by natural language processing technique Topic carries out word segmentation processing, therefrom to extract at least one characteristics of objects.It, can also be into one in the other embodiments of the application It walks and part-of-speech tagging processing is carried out to the characteristics of objects extracted.In a preferred embodiment, it can be the product word marked out Preset label is put on, to help the accuracy rate of entire classification prediction is improved.

More specifically, word weight can be passed through（TermWeight）Technology, to realize the mark of data object to be predicted Topic is divided into several words, using several words as characteristics of objects, and puts on label for product word.For example, the mark of a commodity Entitled " the white chiffon one-piece dress of supply ", then the characteristics of objects extracted can be " supply ", " white ", " chiffon ", " one-piece dress CP ", wherein CP are the preset label to product word in this example, i.e. it is product that CP, which identifies " one-piece dress " in the present embodiment,.

According to the application preferred embodiment, after several characteristics of objects being extracted at step S110, Such as by check rule of thumb or as needed pre-set filtering vocabulary, processing is filtered to these characteristics of objects, To filter out those for the substantially nonsensical characteristics of objects of prediction, such as those all can often go out in most of title Existing word.It is possible thereby to reduce unnecessary calculation amount in classification prediction, the computation complexity in classification prediction is reduced, class is improved Mesh forecasting efficiency.

For example, in the examples described above, " supply " this feature often occurred in most of title can be filtered out.

At step S120, according to characteristics of objects, from the title and correspondence for being in advance based on data with existing object in database Data object classification constructed by characteristics tree in, obtain characteristic set, the characteristic set include characteristics of objects in exist connection The single characteristics of objects contacted is not present with other characteristics of objects in the characteristics of objects pair and characteristics of objects of system.

Specifically, in embodiments herein, characteristics tree is the information based on data with existing object in database（Than Such as title and corresponding classification）Constructed tree enhances naive Bayesian network, to be extracted from the information of data with existing object It is characterized as node, includes the directed networks structure of the topological relation between various nodes.

After the characteristics of objects for extracting data object to be predicted, if can be searched according to the characteristics tree built in advance There is the characteristics of objects pair of contact in dry characteristics of objects and there is no the single objects contacted between other characteristics of objects Feature.

Herein, " there are the characteristics of objects of contact to " refer to, if node A is directed toward node B in characteristics tree, recognizing It is the characteristics of objects pair in the presence of contact for node A and node B." there is no the single objects contacted between other characteristics of objects Feature " refers to, if node C is not pointed towards any other node in characteristics tree, then it is assumed that node C is and other characteristics of objects Between there is no contact single characteristics of objects.

In the examples described above, for example, being found from the characteristics tree built in advance, in feature " supply ", " white ", " snow Spin ", " one-piece dress CP " between any two, only characteristic node " one-piece dress CP " direction " chiffon ", and other feature is between any two not There are contacts（There is no points relationships）, then characteristic set corresponding with the data object to be predicted may include：White, snow It spins, chiffon ＆ one-piece dresses CP.

About the structure of characteristics tree, will be described in detail later in conjunction with Fig. 2 to Fig. 4.

Next, at step S130, according to characteristic set, from being in advance based on data with existing object and correspondence in database Data object classification and feature-classification probability distribution for being counted of characteristics tree in, obtain and each object in characteristic set Feature pair or the corresponding each classification probability distribution of characteristics of objects.

In embodiments herein, feature-classification probability distribution be in advance based in database data with existing object and Correspondence between corresponding data object classification and characteristics tree are come out, feature and classification, that is, a certain feature Belong to the probability of a certain classification.About the statistic processes of this feature-classification probability distribution, will specifically be retouched in conjunction with Fig. 5 later It states.

For data object to be predicted, after corresponding characteristic set is got at step S120, at step S130, It can be from lookup in the feature counted in advance-classification probability distribution and each characteristics of objects pair or characteristics of objects point in characteristic set Not corresponding each classification probability distribution.In one case, in feature-classification probability distribution may search less than with feature set Certain characteristics of objects pair in conjunction or the corresponding feature of characteristics of objects.It according to an embodiment of the present application, can be by such object spy Sign pair or the corresponding classification probability distribution of characteristics of objects are defaulted as zero.

Next, at step S140, according to each classification probability distribution, the prediction classification collection of data object to be predicted is determined It closes.

It specifically, can be by the classification probability distribution of each feature in characteristic set in one embodiment of the application Corresponding score sorts classification according to the height of score according to classification phase adduction, can then export each classification after sequence and make To predict classification set.

According to an embodiment of the present application, classification probability distribution may include the distribution of Di Li Crays.Certainly, classification probability distribution It can also be indicated using any other probability-distribution function of the known in the art or following exploitation, beta point can be used for example Cloth etc..

For example, for above-mentioned example " the white chiffon one-piece dress of supply ", obtained classification probability distribution（It is Di Li in this example Cray is distributed）Situation（Predict classification set）Can be：

One-piece dress	0.9802
		Virgin skirt	0.0098
Clothes are processed	0.0054
		Big code women's dress	0.0027
Wedding gauze kerchief, full dress	0.0005

So far Fig. 1 is combined to describe the method for the classification for prediction data object according to the application one embodiment Entire processing procedure.

Conceived according to present invention, it is pre- as original training data using the information of data with existing object in database Tree enhancing naive Bayesian network is first established, the classification for data object is predicted.It is retouched in more detail with reference to Fig. 2 to Fig. 4 State the method for advance construction feature tree according to the embodiment of the present application.

Fig. 2 is the flow chart according to the method 200 for construction feature tree of the application one embodiment.As shown in Fig. 2, At step S210, at least one primitive character is extracted from data with existing object in database.

It is operated specifically, Website server can be directed to existing all data objects in database, from each number According to corresponding at least one primitive character is extracted in object respectively.It in a typical embodiment, can be from data with existing pair At least one primitive character is extracted in the title of elephant.For the sake of for ease of description, below all with from the title of data with existing object The embodiment of construction feature tree is described for extraction primitive character.But it will be understood by those skilled in the art that the application's In other embodiments, primitive character can also be extracted from the information such as the abstract of data with existing object, details, attribute.

More specifically, word segmentation processing can be carried out to the title of each data object, each word that word segmentation processing is obtained is made For primitive character corresponding with the data object.The application does not do any restrictions to word segmentation processing, can be by this field The arbitrary word segmentation processing mode known is completed, therefore which is not described herein again.

In the other embodiments of the application, further the primitive character extracted can also be carried out at part-of-speech tagging Reason.In a preferred embodiment, can be that the product word marked out puts on preset label, it in this way can be according to characteristic attribute more Meticulously distinguishing characteristic, and then feature-rich tree help to improve the accuracy rate of entire classification prediction.

It is to be herein pointed out the feature that the feature extraction mode in construction feature tree uses when should be predicted with classification Extracting mode is identical.For example, if carrying out label for labelling to product word in construction feature tree, it is also required to when classification is predicted Label for labelling is carried out to the product word in object titles to be predicted.If in construction feature tree not to product word into row label Mark then also need not carry out label for labelling when classification is predicted to the product word in object titles to be predicted.

Next, at step S220, the relevance between the feature two-by-two in primitive character is determined.

It, can be according to mutual trust of the feature two-by-two in inhomogeneity now in primitive character according to one embodiment of the application It ceases to determine the relevance between feature two-by-two.

Specifically, after obtaining all primitive characters corresponding with data with existing object in database, it may be determined that Whether any two primitive character in these primitive characters occurred in the same title, that is, determined the two primitive characters Whether in a same class co-occurrence mistake now, so count the two primitive characters same class co-occurrence now number, i.e., two The classification of two features is distributed.In addition, count the number that each primitive character occurs now in inhomogeneity, i.e., the class of each primitive character Mesh is distributed.

It is then possible to be distributed according to the distribution of the classification of feature two-by-two and the respective classification of the two features, this is counted two-by-two Mutual information of the feature in each class now, such as formula（1）It is shown.Mutual information is a kind of useful measure information in information theory, it It refer to the correlation between two event sets.Herein, the relevance between two features is described using mutual information.Under Formula（1）Show the computational methods of mutual information.

Wherein I_i(x;Y) indicate that the mutual information of feature x and feature y at classification i, p (x, y) indicate that feature x and feature y exists The probability occurred jointly under classification i, p (x) indicate that the probability that feature x occurs at classification i, p (y) indicate feature y at classification i The probability of appearance.

Then, feature two-by-two is added in the mutual information of each class now, total mutual information of feature two-by-two can be obtained, such as Formula（2）It is shown.

I(x;y)=∑I_i(x;y) (2)

Wherein I (x;Y) total mutual information of x and feature y are characterized.

Later, according to total mutual information of feature two-by-two, it may be determined that the relevance between feature two-by-two.Specifically, If total mutual information of feature is less than predetermined threshold two-by-two, it can determine that the relevance two-by-two between feature is low.If two Total mutual information of two features is greater than or equal to predetermined threshold, then can determining this, the relevance between feature is high two-by-two.Here Predetermined threshold rule of thumb can arbitrarily be set, and the application is not restricted this.

The foregoing describe determine the two spies in inhomogeneity mutual information now according to the feature two-by-two in primitive character Relevance between sign, but the application is not limited to this, but any appropriate of the known in the art or following exploitation may be used Mode determines the relevance between two features.For example, in the other embodiments of the application, it can be by two features The mode that the search result obtained in the case of being scanned for as search term is compared, to determine between the two features Relevance.Specifically, when the similarity of search result is high, it may be determined that the relevance between the two features is high.Work as search As a result when similarity is low, it may be determined that the relevance between the two features is low.

It, can at step S230 with reference to Fig. 2 after relevance between the feature two-by-two during primitive character is determined According to the relevance between feature two-by-two, to build wherein using primitive character as the characteristics tree of node.

Specifically, can characterized by primitive character tree node, it is then special two-by-two according to what is obtained in step S220 Relevance between sign connects the corresponding node of related feature in primitive character, to construct characteristics tree.

It is after the relevance between feature two-by-two is determined and special in structure according to the application preferred embodiment Before sign tree, i.e., can also include that processing is filtered to primitive character after step S220 and before step S230 Step（It is not shown in figure）.Specifically, the feature two-by-two that relevance is less than predetermined threshold can be filtered out, and only by relevance Carry out construction feature tree as node more than or equal to the feature in the feature two-by-two of predetermined threshold.It is possible thereby to improve classification prediction Accuracy rate.

With reference to Fig. 3 descriptions according to the method 300 for construction feature tree of one more specific embodiment of the application.Side The process of construction feature tree has been described in detail in method 300.

As shown in figure 3, at step S310, using primitive character as node, according to the relevance between feature two-by-two, will have The node of relevant property links together, and generates maximum spanning tree.

Then, at step S320, any node in maximum spanning tree is chosen as root node, is with maximum spanning tree Basis generates topological tree.

It is to be herein pointed out topological tree is oriented tree construction.Make when having chosen any node in maximum spanning tree After root node, the other nodes for having relevance with the root node are directed toward by the root node, these other nodes are known as son Node, each in these child nodes are further directed to the other nodes for having relevance with the child node, and so on, it can To generate topological tree.It is understood that topological tree is really the direction increased on the basis of maximum spanning tree between node.

Next, at step S330, increase the company between each node in the case where not changing the topological structure of topological tree It connects so that each node can be attached with more than two and 100 other nodes below, to which structure obtains characteristics tree.

Specifically, the topological structure for not changing topological tree refers to the direction for not changing topological tree interior joint.Usually setting In the application for enhancing naive Bayesian network, when connection between increasing node, typically only allow for a node at most with it is other Two node connections.And in embodiments herein, a node can be made to be connect with more other nodes.In this Shen In a preferred embodiment please, in the connection between increasing node, can make a node with it is more than two and 100 with Under other nodes connection.It is hereby achieved that more optimized feature-rich tree（It is to be understood that the simple shellfish of tree enhancing of extension This network of leaf）, covering more fully feature combines, to greatly improve the accuracy rate of classification prediction.

So far Fig. 2 and Fig. 3 is combined to describe the process of construction feature tree.After constructing characteristics tree, it is also necessary to count Feature-classification probability distribution, for being used in the prediction of follow-up classification.With reference to Fig. 4 descriptions according to one implementation of the application The method 400 for statistical nature-classification probability distribution of example.

As shown in figure 4, at step S410, primitive character collection corresponding with each data object is obtained respectively from characteristics tree Close, primitive character set include primitive character in exist contact primitive character pair and primitive character in other primitive characters There is no the single primitive characters of contact.

Specifically, the title of existing each data object in database can be directed to, at least one original is therefrom extracted Beginning feature.Then, from the primitive character pair and original searched in the above-mentioned characteristics tree built in advance in primitive character in the presence of contact The single primitive character contacted is not present with other primitive characters in beginning feature.As previously mentioned, herein, there are contacts To referring to two primitive characters in characteristics tree there are points relationship, the single primitive character there is no contact is primitive character Refer to the single primitive character that any other node is not directed in characteristics tree.It is possible thereby to obtain corresponding respectively with each data object Primitive character set.

Then, at step S420, according to the corresponding primitive character set of each data object and data object classification, system Each primitive character pair or primitive character in meter primitive character set is in all kinds of total degrees occurred now and in each classification The number of lower appearance, to obtain feature-classification probability distribution.

Specifically, in the primitive character set that is obtained in step S410 each primitive character pair or single primitive character For sample data, the total degree that each sample data occurs now in the corresponding each data object class of each data object is counted respectively And the number occurred now in each class.It is hereby achieved that probability of each sample data in each class now, so as to Obtain the classification probability distribution of each sample data, i.e. correspondence between feature and classification probability distribution（Feature-classification probability Distribution）.Feature designated herein may include primitive character pair, can also include single primitive character.

In a preferred embodiment, when counting each sample data in the number that inhomogeneity occurs now, in order to keep away Exempt from some popular classifications causes to bias because character pair quantity is larger, can the classification more to occurrence number carry out at drop power Reason.For example, being directed to feature " one-piece dress ", counts it and occur 100 times under classification " one-piece dress ", and it is in classification " person in middle and old age Only occur under women's dress " 10 times.Assuming that each feature all kinds of average times occurred now be 20, then can by feature " even clothing The number that skirt " occurs under classification " one-piece dress " for example reduces by 100/20=5 times, is thus carried out at drop power to " one-piece dress " classification Reason.The generation of classification biasing can be resisted in this way, it is ensured that the more related classifications of covering improve the accuracy rate of classification prediction.

According to the application preferred embodiment, when obtaining classification probability distribution, feature can be calculated about classification Di Li Crays distribution.It will be appreciated that the application is not limited to the distribution of Di Li Crays, any other probability distribution can also be used Expression way.

The method for describing the classification for prediction data object according to the embodiment of the present application above in association with Fig. 1 to Fig. 4. According to the present processes, tree enhancing Piao can be built based on the information of data with existing object in database and its corresponding classification Plain Bayesian network model simultaneously carries out classification prediction based on the model so that the related data for covering full site databases carries The high accuracy rate of classification prediction.In addition, all unduplicated words after participle is used to come as feature in the present processes Structure tree enhancing naive Bayesian network model, so ensure that will not be because of certain dittographs in the prediction of data object classification Deviation is generated, the accuracy rate of classification prediction is improved.In addition, the present processes enhance naive Bayesian network in utility tree When reduce the condition connected between node, and each node is allowed to be attached with more other nodes, greatly enriched Entire tree enhances naive Bayesian network, further improves the accuracy rate of classification prediction.

Similar with the above-mentioned method of classification of prediction data object that is used for, the application also provides a kind of for prediction data pair The class destination device of elephant.

With reference to Fig. 5, Fig. 5 is the class destination device 500 for prediction data object according to the application one embodiment Structure diagram.

As shown in figure 5, device 500 may include characteristic extracting module 510, the first acquisition module 520, the second acquisition module 530 and classification determining module 540.

Specifically, characteristic extracting module 510 can be used for from data object to be predicted extracting at least one object spy Sign.First acquisition module 520 can be used for according to characteristics of objects, from being in advance based in database data with existing object and corresponding In characteristics tree constructed by data object classification, characteristic set is obtained, this feature set includes the presence of contact in characteristics of objects The single characteristics of objects contacted is not present with other characteristics of objects in characteristics of objects pair and characteristics of objects.Second acquisition module 530 can be used for according to characteristic set, from be in advance based in database data with existing object and corresponding data object classification with And in feature-classification probability distribution for being counted of characteristics tree, obtain and each characteristics of objects pair or characteristics of objects in characteristic set Corresponding each classification probability distribution.Classification determining module 540 can be used for, according to each classification probability distribution, determining to be predicted The prediction classification set of data object.

According to an embodiment of the present application, the first acquisition module 520 may further include（It is not shown in figure）：Primitive character Extracting sub-module, for extracting at least one primitive character from data with existing object in database；Relevance determination sub-module, For determining the relevance between the feature two-by-two in the primitive character；And characteristics tree builds submodule, for according to institute The relevance between feature two-by-two is stated, is built wherein using the primitive character as the characteristics tree of node.

According to the more specific embodiment of the application, relevance determination sub-module can be according to the feature two-by-two in primitive character In the mutual information of inhomogeneity now, the relevance between feature two-by-two is determined.

According to the more specific embodiment of the application, relevance determination sub-module may further include（It is not shown in figure）： Statistic submodule, for being distributed according to the distribution of the classification of the feature two-by-two in primitive character and the respective classification of the two features, Count mutual information of the feature in each class now two-by-two；It is added submodule, for the mutual trust by feature two-by-two in each class now Manner of breathing adds to obtain total mutual information of feature two-by-two；And determination sub-module, it is used for according to total mutual information, between determining feature two-by-two Relevance.

According to the more specific embodiment of the application, characteristics tree structure submodule may further include（It is not shown in figure）： First generates submodule, is used for using primitive character as node, according to the relevance between feature two-by-two, by the section with relevance Point links together, and generates maximum spanning tree；Second generates submodule, for choosing any node conduct in maximum spanning tree Root node generates topological tree based on maximum spanning tree；And increase submodule, in the topology knot for not changing topological tree Increase the connection between each node in the case of structure so that each node can with more than two and 100 other nodes below into Row connection, to which structure obtains characteristics tree.

According to an embodiment of the present application, the first acquisition module 520 can also include（It is not shown in figure）：Filter submodule, For according to the relevance between feature two-by-two, processing to be filtered to primitive character.

According to an embodiment of the present application, the second acquisition module 530 may further include（It is not shown in figure）：Set obtains Submodule, for obtaining primitive character set corresponding with each data object respectively from characteristics tree, primitive character set includes Exist in primitive character in the primitive character pair and primitive character of contact and the single original contacted is not present with other primitive characters Beginning feature；And classification distribution statistics submodule, for according to the corresponding primitive character set of each data object and data pair As classification, count each primitive character pair in primitive character set or primitive character all kinds of total degrees occurred now and The number that each class occurs now, to obtain feature-classification probability distribution.

According to an embodiment of the present application, device 500 can also include：Filtering module（It is not shown in figure）, for described Characteristics of objects is filtered processing.

According to an embodiment of the present application, classification determining module 540 may further include（It is not shown in figure）：Score is added Submodule, for the corresponding score of classification probability distribution of each feature in characteristic set to be added according to classification；Sort submodule Block, for classification to sort according to the height of score；And output sub-module, for exporting the classification after sorting as prediction class Mesh set.

So far the class destination device for prediction data object according to the application one embodiment is described.Above description Class destination device for prediction data object and the classification for prediction data object described before method processing It is corresponding, accordingly, with respect to its detail, may refer to the method for the classification for prediction data object described before, Which is not described herein again.

In a typical configuration, computing device includes one or more processors (CPU), input/output interface, net Network interface and memory.

Memory may include computer-readable medium in volatile memory, random access memory (RAM) and/or The forms such as Nonvolatile memory, such as read-only memory (ROM) or flash memory (flash RAM).Memory is computer-readable medium Example.

Computer-readable medium includes permanent and non-permanent, removable and non-removable media can be by any method Or technology realizes information storage.Information can be computer-readable instruction, data structure, the module of program or other data. The example of the storage medium of computer includes, but are not limited to phase transition internal memory (PRAM), static RAM (SRAM), moves State random access memory (DRAM), other kinds of random access memory (RAM), read-only memory (ROM), electric erasable Programmable read only memory (EEPROM), fast flash memory bank or other memory techniques, read-only disc read only memory (CD-ROM) (CD-ROM), Digital versatile disc (DVD) or other optical storages, magnetic tape cassette, tape magnetic disk storage or other magnetic storage apparatus Or any other non-transmission medium, it can be used for storage and can be accessed by a computing device information.As defined in this article, it calculates Machine readable medium does not include temporary computer readable media (transitory media), such as data-signal and carrier wave of modulation.

It should also be noted that, the terms "include", "comprise" or its any other variant are intended to nonexcludability Including so that including a series of elements process, method, data object or equipment not only include those elements, but also Further include other elements that are not explicitly listed, or further includes solid by this process, method, data object or equipment Some elements.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that including There is also other identical elements in the process of the element, method, data object or equipment.

It should be understood by those skilled in the art that, embodiments herein can be provided as method, equipment or computer program Product.Therefore, complete hardware embodiment, complete software embodiment or reality combining software and hardware aspects can be used in the application Apply the form of example.Moreover, the application can be used in one or more wherein include computer usable program code computer Usable storage medium（Including but not limited to magnetic disk storage, CD-ROM, optical memory etc.）The computer program of upper implementation produces The form of product.

Above is only an example of the present application, it is not intended to limit this application, for those skilled in the art For member, the application can have various modifications and variations.Any modification made by within the spirit and principles of this application, Equivalent replacement, improvement etc., should be included within the scope of claims hereof.

Claims

1. a kind of method of classification for prediction data object, which is characterized in that including：

At least one characteristics of objects is extracted from data object to be predicted, and part of speech mark is carried out at least one characteristics of objects Note processing；

According to the characteristics of objects and part-of-speech tagging handling result, from being in advance based in database data with existing object and corresponding In characteristics tree constructed by data object classification, characteristic set is obtained, the characteristic set includes to exist in the characteristics of objects The single characteristics of objects contacted is not present with other characteristics of objects in the characteristics of objects pair of contact and the characteristics of objects；

According to the characteristic set, from being in advance based on data with existing object and corresponding data object classification and institute in database It states in feature-classification probability distribution that characteristics tree is counted, obtains and each characteristics of objects pair or object in the characteristic set The corresponding each classification probability distribution of feature；And

According to each classification probability distribution, the prediction classification set of data object to be predicted is determined.

2. according to the method described in claim 1, it is characterized in that, being in advance based in database data with existing object and corresponding The step of data object classification structure characteristics tree, further comprises：

At least one primitive character is extracted from data with existing object in database；

Determine the relevance between the feature two-by-two in the primitive character；And

According to the relevance between the feature two-by-two, build wherein using the primitive character as the characteristics tree of node.

3. according to the method described in claim 2, it is characterized in that, determining the pass between the feature two-by-two in the primitive character The step of connection property, further comprises：

According to mutual information of the feature two-by-two in inhomogeneity now in the primitive character, determine described in the pass between feature two-by-two Connection property.

4. according to the method described in claim 3, it is characterized in that, the feature two-by-two according in the primitive character is not Similar mutual information now, determine described in two-by-two relevance between feature the step of further comprise：

It is distributed according to the distribution of the classification of the feature two-by-two in the primitive character and the respective classification of the two features, described in statistics Mutual information of the feature in each class now two-by-two；

The feature two-by-two is added to obtain total mutual information of the feature two-by-two in the mutual information of each class now；And

According to total mutual information, determine described in the relevance between feature two-by-two.

5. according to the method described in claim 2, it is characterized in that, the relevance between feature two-by-two described in the basis, structure It builds and wherein further comprises by the step of characteristics tree of node of the primitive character：

Using the primitive character as node, according to the relevance between the feature two-by-two, the node with relevance is connected Together, maximum spanning tree is generated；

Any node in the maximum spanning tree is chosen as root node, topology is generated based on the maximum spanning tree Tree；And

Increase the connection between each node in the case where not changing the topological structure of topological tree so that each node can with two with Upper and 100 other nodes below are attached, to which structure obtains the characteristics tree.

6. according to the method described in claim 2, it is characterized in that, described in determining two-by-two relevance between feature the step of Later, and before the step of building wherein using the primitive character as the characteristics tree of node, further include：

According to the relevance between the feature two-by-two, processing is filtered to primitive character.

7. according to the method described in any one of claim 1-6, which is characterized in that be in advance based on data with existing pair in database As and corresponding data object classification and the characteristics tree carry out statistical nature-classification probability distribution the step of further comprise：

Obtain primitive character set corresponding with each data object respectively from the characteristics tree, the primitive character set includes Exist to be not present with other primitive characters in the primitive character pair contacted and the primitive character in the primitive character and contact Single primitive character；And

According to the corresponding primitive character set of each data object and data object classification, each original in primitive character set is counted Beginning feature pair or primitive character are in all kinds of total degrees occurred now and the number occurred now in each class, to obtain spy Sign-classification probability distribution.

8. according to the method described in any one of claim 1-6, which is characterized in that from data object to be predicted extraction to After the step of few characteristics of objects, and according to the characteristics of objects, from being in advance based on data with existing object in database And before the step of obtaining characteristic set in the characteristics tree constructed by corresponding data object classification, further include：

Processing is filtered to the characteristics of objects.

9. according to the method described in any one of claim 1-6, which is characterized in that according to each classification probability distribution, really The step of prediction classification set of fixed data object to be predicted, further comprises：

The corresponding score of classification probability distribution of each feature in characteristic set is added according to classification；

Classification is sorted according to the height of the score；And

Classification after output sequence is as prediction classification set.

10. a kind of class destination device for prediction data object, which is characterized in that including：

Characteristic extracting module, for extracting at least one characteristics of objects from data object to be predicted, and to described at least one Characteristics of objects carries out part-of-speech tagging processing；

First acquisition module, for according to the characteristics of objects and part-of-speech tagging handling result, from being in advance based in database Have in the characteristics tree constructed by data object and corresponding data object classification, obtain characteristic set, the characteristic set includes Exist to be not present with other characteristics of objects in the characteristics of objects pair contacted and the characteristics of objects in the characteristics of objects and contact Single characteristics of objects；

Second acquisition module, for according to the characteristic set, from being in advance based in database data with existing object and corresponding In the feature that data object classification and the characteristics tree are counted-classification probability distribution, in acquisition and the characteristic set Each characteristics of objects pair or the corresponding each classification probability distribution of characteristics of objects；And

Classification determining module, for according to each classification probability distribution, determining the prediction classification set of data object to be predicted.

11. device according to claim 10, which is characterized in that the first acquisition module further comprises：

Primitive character extracting sub-module, for extracting at least one primitive character from data with existing object in database；

Relevance determination sub-module, for determining the relevance between the feature two-by-two in the primitive character；And

Characteristics tree builds submodule, for according to the relevance between the feature two-by-two, structure to be wherein with the primitive character For the characteristics tree of node.

12. according to the devices described in claim 11, which is characterized in that relevance determination sub-module is according in the primitive character Mutual information of the feature two-by-two in inhomogeneity now, determine described in the relevance between feature two-by-two.

13. device according to claim 12, which is characterized in that relevance determination sub-module further comprises：

Statistic submodule, for according to the distribution of the classification of the feature two-by-two in the primitive character and the respective class of the two features Mesh is distributed, mutual information of the feature in each class now two-by-two described in statistics；

It is added submodule, the total of the feature two-by-two is obtained for the feature two-by-two to be added in the mutual information of each class now Mutual information；And

Determination sub-module, for the relevance between feature two-by-two described according to total mutual information, determining.

14. according to the devices described in claim 11, which is characterized in that the characteristics tree structure submodule further comprises：

First generates submodule, for using the primitive character as node, according to the relevance between the feature two-by-two, will have The node of relevant property links together, and generates maximum spanning tree；

Second generates submodule, for choosing any node in the maximum spanning tree as root node, with described maximum raw Topological tree is generated based on Cheng Shu；And

Increase submodule, for increasing the connection between each node in the case where not changing the topological structure of topological tree so that every A node can be attached with more than two and 100 other nodes below, to which structure obtains the characteristics tree.

15. according to the devices described in claim 11, which is characterized in that the first acquisition module further includes：

Filter submodule is filtered processing for the relevance between feature two-by-two according to primitive character.

16. according to the device described in any one of claim 10-15, which is characterized in that the second acquisition module further comprises：

Gather acquisition submodule, for obtaining primitive character set corresponding with each data object respectively from the characteristics tree, The primitive character set include the primitive character in exist contact primitive character pair and the primitive character in its There is no the single primitive characters of contact for its primitive character；And

Classification distribution statistics submodule is used for according to the corresponding primitive character set of each data object and data object classification, Each primitive character pair or primitive character in statistics primitive character set is in all kinds of total degrees occurred now and in each class Now the number occurred, to obtain feature-classification probability distribution.

17. according to the device described in any one of claim 10-15, which is characterized in that further include：

Filtering module, for being filtered processing to the characteristics of objects.

18. according to the device described in any one of claim 10-15, which is characterized in that classification determining module further comprises：

Score is added submodule, is used for the corresponding score of classification probability distribution of each feature in characteristic set according to classification phase Add；

Sorting sub-module, for classification to sort according to the height of the score；And

Output sub-module, for exporting the classification after sorting as prediction classification set.