CN104615605B - The method and apparatus of classification for prediction data object - Google Patents
The method and apparatus of classification for prediction data object Download PDFInfo
- Publication number
- CN104615605B CN104615605B CN201310542419.5A CN201310542419A CN104615605B CN 104615605 B CN104615605 B CN 104615605B CN 201310542419 A CN201310542419 A CN 201310542419A CN 104615605 B CN104615605 B CN 104615605B
- Authority
- CN
- China
- Prior art keywords
- classification
- feature
- primitive character
- objects
- tree
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
This application involves the method and apparatus of the classification for prediction data object.This method includes:At least one characteristics of objects is extracted from data object to be predicted;According to characteristics of objects, from being in advance based in the characteristics tree in database constructed by data with existing object and corresponding data object classification, characteristic set is obtained, characteristic set includes that the single characteristics of objects contacted is not present with other characteristics of objects in the characteristics of objects pair and characteristics of objects for exist in characteristics of objects contact;According to characteristic set, from being in advance based in the feature classification probability distribution that data with existing object and corresponding data object classification and characteristics tree are counted in database, obtain and each characteristics of objects pair or the corresponding each classification probability distribution of characteristics of objects in characteristic set;And according to each classification probability distribution, determine the prediction classification set of data object to be predicted.According to the scheme of the application, the accuracy rate of the classification prediction to data object can be improved.
Description
Technical field
This application involves data processing field, relate more specifically to a kind of method of the classification for prediction data object and
Device.
Background technology
With the continuous development of online data interaction, for number of site server, data object is being obtained
After essential information such as title, attribute description etc., generally require data object being suspended on the classification of backstage, so as to follow-up conduct
The foundation of the navigation of data object classification, various dimension datas statistics, product library construction etc. in search.Therefore, it is necessary to data pair
The classification of elephant is predicted, with the association classification of the determination data object.
It is to click dictionary based on classification to carry out, wherein classification is clicked in a kind of classification prediction scheme of prior art
Dictionary is the classification point that each word is counted according to the historical query word and the corresponding classification click condition of the historical query word of user
Hit distribution.More specifically, when needing the classification to some data object to predict, first to the title of the data object
Word segmentation processing is carried out to obtain at least one word, the classification click point that dictionary counts each word is then clicked according to above-mentioned classification
Cloth, and it is chosen at the prediction classification for occurring most classifications in all words as the data object.
However, the classification due to user is clicked than sparse, the data of magnanimity can not be covered;The input of partial query word is also
The phenomenon that along with malicious user brush query word(Certain user is inquired associated with itself to improve repeatedly using certain query words
The clicking rate of information), it is not very accurate to lead to the data that the classification of word is clicked, and seriously affects the class obtained with these data predictions
Purpose accuracy rate.In addition, when predicting classification may due to title in certain repeating for word and cause to predict
One inaccurate classification.
Therefore, a kind of improved classification Predicting Technique of demand, to overcome the above problem in the prior art, to improve logarithm
The accuracy rate predicted according to the classification of object.
Invention content
The application's is designed to provide a kind of technology of the classification for prediction data object, can be to data object
Classification more accurately predicted, to determine the association classification of data object.
Specifically, according to the one side of the embodiment of the present application, a kind of classification for prediction data object is provided
Method, which is characterized in that including:At least one characteristics of objects is extracted from data object to be predicted;According to characteristics of objects, from pre-
First based in data with existing object in database and the characteristics tree constructed by corresponding data object classification, characteristic set is obtained,
Characteristic set includes to be not present with other characteristics of objects in the characteristics of objects pair and characteristics of objects for exist in characteristics of objects contact
The single characteristics of objects of contact;According to characteristic set, from being in advance based on data with existing object and corresponding data pair in database
In the feature counted as classification and characteristics tree-classification probability distribution, obtain with characteristic set in each characteristics of objects pair or
The corresponding each classification probability distribution of characteristics of objects;And according to each classification probability distribution, determine data object to be predicted
Predict classification set.
According to the other side of the embodiment of the present application, a kind of class destination device for prediction data object is provided,
It is characterized in that, including:Characteristic extracting module, for extracting at least one characteristics of objects from data object to be predicted;First obtains
Modulus block, for according to characteristics of objects, from being in advance based on data with existing object and corresponding data object classification institute in database
In the characteristics tree of structure, characteristic set is obtained, characteristic set includes the characteristics of objects pair and right for existing in characteristics of objects contact
As the single characteristics of objects contacted is not present with other characteristics of objects in feature;Second acquisition module is used for according to characteristic set,
From being in advance based on feature-classification that data with existing object and corresponding data object classification and characteristics tree are counted in database
In probability distribution, obtain and each characteristics of objects pair or the corresponding each classification probability distribution of characteristics of objects in characteristic set;
And classification determining module, for according to each classification probability distribution, determining the prediction classification set of data object to be predicted.
Compared with prior art, the scheme of the application is based on database(Such as site databases)Interior data with existing object and
It corresponds to classification to build tree enhancing naive Bayesian network model(Characteristics tree)And classification prediction is carried out based on the model, make
The related data that full site databases must be covered improves the accuracy rate of classification prediction.Divide in addition, being used in the scheme of the application
All unduplicated words after word build tree enhancing naive Bayesian network model as feature, so ensure that data object
Deviation will not be generated because of certain dittographs in classification prediction, improve the accuracy rate of classification prediction.In addition, the side of the application
Case reduces the condition connected between node when utility tree enhances naive Bayesian network, and allow each node with it is more
Other nodes be attached, be greatly enriched entire tree enhancing naive Bayesian network, further improve classification prediction
Accuracy rate.
Description of the drawings
Attached drawing described herein is used for providing further understanding of the present application, constitutes part of this application, this Shen
Illustrative embodiments and their description please do not constitute the improper restriction to the application for explaining the application.In the accompanying drawings:
Fig. 1 is the flow chart according to the method for the classification for prediction data object of the application one embodiment;
Fig. 2 is the flow chart according to the method for construction feature tree of the application one embodiment;
Fig. 3 is the flow chart according to the method for construction feature tree of the more specific embodiment of the application;
Fig. 4 is the flow chart according to the method for statistical nature-classification probability distribution of the application one embodiment;With
And
Fig. 5 is the structure diagram according to the class destination device for prediction data object of the application one embodiment.
Specific implementation mode
The main thought of the application is that, by by database(Such as site databases)Interior data with existing object and its
The information of corresponding classification is as original training data, structure tree enhancing naive Bayesian network, to carry out data object to be predicted
Classification prediction, with the association classification of determination data object to be predicted.Specifically, by being based on data with existing pair in database
As and its information of corresponding classification come construction feature tree, and the letter based on data with existing object in database and its correspondence classification
Breath and characteristics tree carry out statistical nature-classification probability distribution, to make the characteristics tree obtained in this way and feature-classification probability distribution
Subsequently to treat the foundation that prediction data object carries out classification prediction.
In addition, the thought of the application also resides in, during the foundation of tree enhancing naive Bayesian network, advanced optimize
Connection possibility between tree enhancing naive Bayesian network node, number can be connected by improving the maximum of each node.To prevent
So that Partial Feature can not be with other spies due to the sparse of tree enhancing naive Bayesian network during progress classification prediction
Sign connection, cause prediction result can because be characterized number of combinations it is less due to can not cover more related classifications or generate biasing.
A node is typically only allowed at most to be saved with other two specifically, the application breaks traditional tree enhancing naive Bayesian network
The thought that point is attached, and a node is attached with more other nodes, such as can reach and 100
A other nodes are attached, so that entire tree enhancing naive Bayesian network is more intensive, covering more fully feature,
And then improve the predictablity rate of data object classification.
To keep the purpose, technical scheme and advantage of the application clearer, below in conjunction with the application specific embodiment and
Technical scheme is clearly and completely described in corresponding attached drawing, it is clear that described embodiment is only the application one
Section Example, instead of all the embodiments.Based on the embodiment in the application, those of ordinary skill in the art are not doing
Go out the every other embodiment obtained under the premise of creative work, shall fall in the protection scope of this application.
The classification prediction scheme of the application can be applied to carry out the various application scenarios of classification prediction, that is, be suitable for needle
Classification prediction to various data objects.For example, can be adapted for various Website servers to its business object or service object
Deng classification(Classification)It is predicted.In a typical case scene, the scheme of the application can be applied to e-business network
Site server predicts the classification of extensive stock, to determine the classification with commodity association.It should be pointed out that the application is to applied field
Scape is simultaneously not intended to be limited in any, but can be adapted for other any suitable classification prediction scenes of existing or future exploitation.
Referring to Fig.1, the method that Fig. 1 shows the classification for prediction data object according to the application one embodiment
100 flow chart.
As shown in Figure 1, at step S110, at least one characteristics of objects is extracted from data object to be predicted.
Specifically, can be extracted from the information such as the title of data object to be predicted, abstract, details, attribute
At least one characteristics of objects.In a typical embodiment, at least one can be extracted from the title of data object to be predicted
A characteristics of objects.For the sake of for ease of description, come all by taking extracting object feature in the title from data object to be predicted as an example below
Embodiments herein is described.It, can also be from waiting for but it will be understood by those skilled in the art that in the other embodiments of the application
Extracting object feature in the information such as abstract, details, the attribute of prediction data object.
In one embodiment of the application, the mark of prediction data object can be treated by natural language processing technique
Topic carries out word segmentation processing, therefrom to extract at least one characteristics of objects.It, can also be into one in the other embodiments of the application
It walks and part-of-speech tagging processing is carried out to the characteristics of objects extracted.In a preferred embodiment, it can be the product word marked out
Preset label is put on, to help the accuracy rate of entire classification prediction is improved.
More specifically, word weight can be passed through(TermWeight)Technology, to realize the mark of data object to be predicted
Topic is divided into several words, using several words as characteristics of objects, and puts on label for product word.For example, the mark of a commodity
Entitled " the white chiffon one-piece dress of supply ", then the characteristics of objects extracted can be " supply ", " white ", " chiffon ", " one-piece dress
CP ", wherein CP are the preset label to product word in this example, i.e. it is product that CP, which identifies " one-piece dress " in the present embodiment,.
According to the application preferred embodiment, after several characteristics of objects being extracted at step S110,
Such as by check rule of thumb or as needed pre-set filtering vocabulary, processing is filtered to these characteristics of objects,
To filter out those for the substantially nonsensical characteristics of objects of prediction, such as those all can often go out in most of title
Existing word.It is possible thereby to reduce unnecessary calculation amount in classification prediction, the computation complexity in classification prediction is reduced, class is improved
Mesh forecasting efficiency.
For example, in the examples described above, " supply " this feature often occurred in most of title can be filtered out.
At step S120, according to characteristics of objects, from the title and correspondence for being in advance based on data with existing object in database
Data object classification constructed by characteristics tree in, obtain characteristic set, the characteristic set include characteristics of objects in exist connection
The single characteristics of objects contacted is not present with other characteristics of objects in the characteristics of objects pair and characteristics of objects of system.
Specifically, in embodiments herein, characteristics tree is the information based on data with existing object in database(Than
Such as title and corresponding classification)Constructed tree enhances naive Bayesian network, to be extracted from the information of data with existing object
It is characterized as node, includes the directed networks structure of the topological relation between various nodes.
After the characteristics of objects for extracting data object to be predicted, if can be searched according to the characteristics tree built in advance
There is the characteristics of objects pair of contact in dry characteristics of objects and there is no the single objects contacted between other characteristics of objects
Feature.
Herein, " there are the characteristics of objects of contact to " refer to, if node A is directed toward node B in characteristics tree, recognizing
It is the characteristics of objects pair in the presence of contact for node A and node B." there is no the single objects contacted between other characteristics of objects
Feature " refers to, if node C is not pointed towards any other node in characteristics tree, then it is assumed that node C is and other characteristics of objects
Between there is no contact single characteristics of objects.
In the examples described above, for example, being found from the characteristics tree built in advance, in feature " supply ", " white ", " snow
Spin ", " one-piece dress CP " between any two, only characteristic node " one-piece dress CP " direction " chiffon ", and other feature is between any two not
There are contacts(There is no points relationships), then characteristic set corresponding with the data object to be predicted may include:White, snow
It spins, chiffon & one-piece dresses CP.
About the structure of characteristics tree, will be described in detail later in conjunction with Fig. 2 to Fig. 4.
Next, at step S130, according to characteristic set, from being in advance based on data with existing object and correspondence in database
Data object classification and feature-classification probability distribution for being counted of characteristics tree in, obtain and each object in characteristic set
Feature pair or the corresponding each classification probability distribution of characteristics of objects.
In embodiments herein, feature-classification probability distribution be in advance based in database data with existing object and
Correspondence between corresponding data object classification and characteristics tree are come out, feature and classification, that is, a certain feature
Belong to the probability of a certain classification.About the statistic processes of this feature-classification probability distribution, will specifically be retouched in conjunction with Fig. 5 later
It states.
For data object to be predicted, after corresponding characteristic set is got at step S120, at step S130,
It can be from lookup in the feature counted in advance-classification probability distribution and each characteristics of objects pair or characteristics of objects point in characteristic set
Not corresponding each classification probability distribution.In one case, in feature-classification probability distribution may search less than with feature set
Certain characteristics of objects pair in conjunction or the corresponding feature of characteristics of objects.It according to an embodiment of the present application, can be by such object spy
Sign pair or the corresponding classification probability distribution of characteristics of objects are defaulted as zero.
Next, at step S140, according to each classification probability distribution, the prediction classification collection of data object to be predicted is determined
It closes.
It specifically, can be by the classification probability distribution of each feature in characteristic set in one embodiment of the application
Corresponding score sorts classification according to the height of score according to classification phase adduction, can then export each classification after sequence and make
To predict classification set.
According to an embodiment of the present application, classification probability distribution may include the distribution of Di Li Crays.Certainly, classification probability distribution
It can also be indicated using any other probability-distribution function of the known in the art or following exploitation, beta point can be used for example
Cloth etc..
For example, for above-mentioned example " the white chiffon one-piece dress of supply ", obtained classification probability distribution(It is Di Li in this example
Cray is distributed)Situation(Predict classification set)Can be:
One-piece dress | 0.9802 |
Virgin skirt | 0.0098 |
Clothes are processed | 0.0054 |
Big code women's dress | 0.0027 |
Wedding gauze kerchief, full dress | 0.0005 |
So far Fig. 1 is combined to describe the method for the classification for prediction data object according to the application one embodiment
Entire processing procedure.
Conceived according to present invention, it is pre- as original training data using the information of data with existing object in database
Tree enhancing naive Bayesian network is first established, the classification for data object is predicted.It is retouched in more detail with reference to Fig. 2 to Fig. 4
State the method for advance construction feature tree according to the embodiment of the present application.
Fig. 2 is the flow chart according to the method 200 for construction feature tree of the application one embodiment.As shown in Fig. 2,
At step S210, at least one primitive character is extracted from data with existing object in database.
It is operated specifically, Website server can be directed to existing all data objects in database, from each number
According to corresponding at least one primitive character is extracted in object respectively.It in a typical embodiment, can be from data with existing pair
At least one primitive character is extracted in the title of elephant.For the sake of for ease of description, below all with from the title of data with existing object
The embodiment of construction feature tree is described for extraction primitive character.But it will be understood by those skilled in the art that the application's
In other embodiments, primitive character can also be extracted from the information such as the abstract of data with existing object, details, attribute.
More specifically, word segmentation processing can be carried out to the title of each data object, each word that word segmentation processing is obtained is made
For primitive character corresponding with the data object.The application does not do any restrictions to word segmentation processing, can be by this field
The arbitrary word segmentation processing mode known is completed, therefore which is not described herein again.
In the other embodiments of the application, further the primitive character extracted can also be carried out at part-of-speech tagging
Reason.In a preferred embodiment, can be that the product word marked out puts on preset label, it in this way can be according to characteristic attribute more
Meticulously distinguishing characteristic, and then feature-rich tree help to improve the accuracy rate of entire classification prediction.
It is to be herein pointed out the feature that the feature extraction mode in construction feature tree uses when should be predicted with classification
Extracting mode is identical.For example, if carrying out label for labelling to product word in construction feature tree, it is also required to when classification is predicted
Label for labelling is carried out to the product word in object titles to be predicted.If in construction feature tree not to product word into row label
Mark then also need not carry out label for labelling when classification is predicted to the product word in object titles to be predicted.
Next, at step S220, the relevance between the feature two-by-two in primitive character is determined.
It, can be according to mutual trust of the feature two-by-two in inhomogeneity now in primitive character according to one embodiment of the application
It ceases to determine the relevance between feature two-by-two.
Specifically, after obtaining all primitive characters corresponding with data with existing object in database, it may be determined that
Whether any two primitive character in these primitive characters occurred in the same title, that is, determined the two primitive characters
Whether in a same class co-occurrence mistake now, so count the two primitive characters same class co-occurrence now number, i.e., two
The classification of two features is distributed.In addition, count the number that each primitive character occurs now in inhomogeneity, i.e., the class of each primitive character
Mesh is distributed.
It is then possible to be distributed according to the distribution of the classification of feature two-by-two and the respective classification of the two features, this is counted two-by-two
Mutual information of the feature in each class now, such as formula(1)It is shown.Mutual information is a kind of useful measure information in information theory, it
It refer to the correlation between two event sets.Herein, the relevance between two features is described using mutual information.Under
Formula(1)Show the computational methods of mutual information.
Wherein Ii(x;Y) indicate that the mutual information of feature x and feature y at classification i, p (x, y) indicate that feature x and feature y exists
The probability occurred jointly under classification i, p (x) indicate that the probability that feature x occurs at classification i, p (y) indicate feature y at classification i
The probability of appearance.
Then, feature two-by-two is added in the mutual information of each class now, total mutual information of feature two-by-two can be obtained, such as
Formula(2)It is shown.
I(x;y)=∑Ii(x;y) (2)
Wherein I (x;Y) total mutual information of x and feature y are characterized.
Later, according to total mutual information of feature two-by-two, it may be determined that the relevance between feature two-by-two.Specifically,
If total mutual information of feature is less than predetermined threshold two-by-two, it can determine that the relevance two-by-two between feature is low.If two
Total mutual information of two features is greater than or equal to predetermined threshold, then can determining this, the relevance between feature is high two-by-two.Here
Predetermined threshold rule of thumb can arbitrarily be set, and the application is not restricted this.
The foregoing describe determine the two spies in inhomogeneity mutual information now according to the feature two-by-two in primitive character
Relevance between sign, but the application is not limited to this, but any appropriate of the known in the art or following exploitation may be used
Mode determines the relevance between two features.For example, in the other embodiments of the application, it can be by two features
The mode that the search result obtained in the case of being scanned for as search term is compared, to determine between the two features
Relevance.Specifically, when the similarity of search result is high, it may be determined that the relevance between the two features is high.Work as search
As a result when similarity is low, it may be determined that the relevance between the two features is low.
It, can at step S230 with reference to Fig. 2 after relevance between the feature two-by-two during primitive character is determined
According to the relevance between feature two-by-two, to build wherein using primitive character as the characteristics tree of node.
Specifically, can characterized by primitive character tree node, it is then special two-by-two according to what is obtained in step S220
Relevance between sign connects the corresponding node of related feature in primitive character, to construct characteristics tree.
It is after the relevance between feature two-by-two is determined and special in structure according to the application preferred embodiment
Before sign tree, i.e., can also include that processing is filtered to primitive character after step S220 and before step S230
Step(It is not shown in figure).Specifically, the feature two-by-two that relevance is less than predetermined threshold can be filtered out, and only by relevance
Carry out construction feature tree as node more than or equal to the feature in the feature two-by-two of predetermined threshold.It is possible thereby to improve classification prediction
Accuracy rate.
With reference to Fig. 3 descriptions according to the method 300 for construction feature tree of one more specific embodiment of the application.Side
The process of construction feature tree has been described in detail in method 300.
As shown in figure 3, at step S310, using primitive character as node, according to the relevance between feature two-by-two, will have
The node of relevant property links together, and generates maximum spanning tree.
Then, at step S320, any node in maximum spanning tree is chosen as root node, is with maximum spanning tree
Basis generates topological tree.
It is to be herein pointed out topological tree is oriented tree construction.Make when having chosen any node in maximum spanning tree
After root node, the other nodes for having relevance with the root node are directed toward by the root node, these other nodes are known as son
Node, each in these child nodes are further directed to the other nodes for having relevance with the child node, and so on, it can
To generate topological tree.It is understood that topological tree is really the direction increased on the basis of maximum spanning tree between node.
Next, at step S330, increase the company between each node in the case where not changing the topological structure of topological tree
It connects so that each node can be attached with more than two and 100 other nodes below, to which structure obtains characteristics tree.
Specifically, the topological structure for not changing topological tree refers to the direction for not changing topological tree interior joint.Usually setting
In the application for enhancing naive Bayesian network, when connection between increasing node, typically only allow for a node at most with it is other
Two node connections.And in embodiments herein, a node can be made to be connect with more other nodes.In this Shen
In a preferred embodiment please, in the connection between increasing node, can make a node with it is more than two and 100 with
Under other nodes connection.It is hereby achieved that more optimized feature-rich tree(It is to be understood that the simple shellfish of tree enhancing of extension
This network of leaf), covering more fully feature combines, to greatly improve the accuracy rate of classification prediction.
So far Fig. 2 and Fig. 3 is combined to describe the process of construction feature tree.After constructing characteristics tree, it is also necessary to count
Feature-classification probability distribution, for being used in the prediction of follow-up classification.With reference to Fig. 4 descriptions according to one implementation of the application
The method 400 for statistical nature-classification probability distribution of example.
As shown in figure 4, at step S410, primitive character collection corresponding with each data object is obtained respectively from characteristics tree
Close, primitive character set include primitive character in exist contact primitive character pair and primitive character in other primitive characters
There is no the single primitive characters of contact.
Specifically, the title of existing each data object in database can be directed to, at least one original is therefrom extracted
Beginning feature.Then, from the primitive character pair and original searched in the above-mentioned characteristics tree built in advance in primitive character in the presence of contact
The single primitive character contacted is not present with other primitive characters in beginning feature.As previously mentioned, herein, there are contacts
To referring to two primitive characters in characteristics tree there are points relationship, the single primitive character there is no contact is primitive character
Refer to the single primitive character that any other node is not directed in characteristics tree.It is possible thereby to obtain corresponding respectively with each data object
Primitive character set.
Then, at step S420, according to the corresponding primitive character set of each data object and data object classification, system
Each primitive character pair or primitive character in meter primitive character set is in all kinds of total degrees occurred now and in each classification
The number of lower appearance, to obtain feature-classification probability distribution.
Specifically, in the primitive character set that is obtained in step S410 each primitive character pair or single primitive character
For sample data, the total degree that each sample data occurs now in the corresponding each data object class of each data object is counted respectively
And the number occurred now in each class.It is hereby achieved that probability of each sample data in each class now, so as to
Obtain the classification probability distribution of each sample data, i.e. correspondence between feature and classification probability distribution(Feature-classification probability
Distribution).Feature designated herein may include primitive character pair, can also include single primitive character.
In a preferred embodiment, when counting each sample data in the number that inhomogeneity occurs now, in order to keep away
Exempt from some popular classifications causes to bias because character pair quantity is larger, can the classification more to occurrence number carry out at drop power
Reason.For example, being directed to feature " one-piece dress ", counts it and occur 100 times under classification " one-piece dress ", and it is in classification " person in middle and old age
Only occur under women's dress " 10 times.Assuming that each feature all kinds of average times occurred now be 20, then can by feature " even clothing
The number that skirt " occurs under classification " one-piece dress " for example reduces by 100/20=5 times, is thus carried out at drop power to " one-piece dress " classification
Reason.The generation of classification biasing can be resisted in this way, it is ensured that the more related classifications of covering improve the accuracy rate of classification prediction.
According to the application preferred embodiment, when obtaining classification probability distribution, feature can be calculated about classification
Di Li Crays distribution.It will be appreciated that the application is not limited to the distribution of Di Li Crays, any other probability distribution can also be used
Expression way.
The method for describing the classification for prediction data object according to the embodiment of the present application above in association with Fig. 1 to Fig. 4.
According to the present processes, tree enhancing Piao can be built based on the information of data with existing object in database and its corresponding classification
Plain Bayesian network model simultaneously carries out classification prediction based on the model so that the related data for covering full site databases carries
The high accuracy rate of classification prediction.In addition, all unduplicated words after participle is used to come as feature in the present processes
Structure tree enhancing naive Bayesian network model, so ensure that will not be because of certain dittographs in the prediction of data object classification
Deviation is generated, the accuracy rate of classification prediction is improved.In addition, the present processes enhance naive Bayesian network in utility tree
When reduce the condition connected between node, and each node is allowed to be attached with more other nodes, greatly enriched
Entire tree enhances naive Bayesian network, further improves the accuracy rate of classification prediction.
Similar with the above-mentioned method of classification of prediction data object that is used for, the application also provides a kind of for prediction data pair
The class destination device of elephant.
With reference to Fig. 5, Fig. 5 is the class destination device 500 for prediction data object according to the application one embodiment
Structure diagram.
As shown in figure 5, device 500 may include characteristic extracting module 510, the first acquisition module 520, the second acquisition module
530 and classification determining module 540.
Specifically, characteristic extracting module 510 can be used for from data object to be predicted extracting at least one object spy
Sign.First acquisition module 520 can be used for according to characteristics of objects, from being in advance based in database data with existing object and corresponding
In characteristics tree constructed by data object classification, characteristic set is obtained, this feature set includes the presence of contact in characteristics of objects
The single characteristics of objects contacted is not present with other characteristics of objects in characteristics of objects pair and characteristics of objects.Second acquisition module
530 can be used for according to characteristic set, from be in advance based in database data with existing object and corresponding data object classification with
And in feature-classification probability distribution for being counted of characteristics tree, obtain and each characteristics of objects pair or characteristics of objects in characteristic set
Corresponding each classification probability distribution.Classification determining module 540 can be used for, according to each classification probability distribution, determining to be predicted
The prediction classification set of data object.
According to an embodiment of the present application, the first acquisition module 520 may further include(It is not shown in figure):Primitive character
Extracting sub-module, for extracting at least one primitive character from data with existing object in database;Relevance determination sub-module,
For determining the relevance between the feature two-by-two in the primitive character;And characteristics tree builds submodule, for according to institute
The relevance between feature two-by-two is stated, is built wherein using the primitive character as the characteristics tree of node.
According to the more specific embodiment of the application, relevance determination sub-module can be according to the feature two-by-two in primitive character
In the mutual information of inhomogeneity now, the relevance between feature two-by-two is determined.
According to the more specific embodiment of the application, relevance determination sub-module may further include(It is not shown in figure):
Statistic submodule, for being distributed according to the distribution of the classification of the feature two-by-two in primitive character and the respective classification of the two features,
Count mutual information of the feature in each class now two-by-two;It is added submodule, for the mutual trust by feature two-by-two in each class now
Manner of breathing adds to obtain total mutual information of feature two-by-two;And determination sub-module, it is used for according to total mutual information, between determining feature two-by-two
Relevance.
According to the more specific embodiment of the application, characteristics tree structure submodule may further include(It is not shown in figure):
First generates submodule, is used for using primitive character as node, according to the relevance between feature two-by-two, by the section with relevance
Point links together, and generates maximum spanning tree;Second generates submodule, for choosing any node conduct in maximum spanning tree
Root node generates topological tree based on maximum spanning tree;And increase submodule, in the topology knot for not changing topological tree
Increase the connection between each node in the case of structure so that each node can with more than two and 100 other nodes below into
Row connection, to which structure obtains characteristics tree.
According to an embodiment of the present application, the first acquisition module 520 can also include(It is not shown in figure):Filter submodule,
For according to the relevance between feature two-by-two, processing to be filtered to primitive character.
According to an embodiment of the present application, the second acquisition module 530 may further include(It is not shown in figure):Set obtains
Submodule, for obtaining primitive character set corresponding with each data object respectively from characteristics tree, primitive character set includes
Exist in primitive character in the primitive character pair and primitive character of contact and the single original contacted is not present with other primitive characters
Beginning feature;And classification distribution statistics submodule, for according to the corresponding primitive character set of each data object and data pair
As classification, count each primitive character pair in primitive character set or primitive character all kinds of total degrees occurred now and
The number that each class occurs now, to obtain feature-classification probability distribution.
According to an embodiment of the present application, device 500 can also include:Filtering module(It is not shown in figure), for described
Characteristics of objects is filtered processing.
According to an embodiment of the present application, classification determining module 540 may further include(It is not shown in figure):Score is added
Submodule, for the corresponding score of classification probability distribution of each feature in characteristic set to be added according to classification;Sort submodule
Block, for classification to sort according to the height of score;And output sub-module, for exporting the classification after sorting as prediction class
Mesh set.
So far the class destination device for prediction data object according to the application one embodiment is described.Above description
Class destination device for prediction data object and the classification for prediction data object described before method processing
It is corresponding, accordingly, with respect to its detail, may refer to the method for the classification for prediction data object described before,
Which is not described herein again.
In a typical configuration, computing device includes one or more processors (CPU), input/output interface, net
Network interface and memory.
Memory may include computer-readable medium in volatile memory, random access memory (RAM) and/or
The forms such as Nonvolatile memory, such as read-only memory (ROM) or flash memory (flash RAM).Memory is computer-readable medium
Example.
Computer-readable medium includes permanent and non-permanent, removable and non-removable media can be by any method
Or technology realizes information storage.Information can be computer-readable instruction, data structure, the module of program or other data.
The example of the storage medium of computer includes, but are not limited to phase transition internal memory (PRAM), static RAM (SRAM), moves
State random access memory (DRAM), other kinds of random access memory (RAM), read-only memory (ROM), electric erasable
Programmable read only memory (EEPROM), fast flash memory bank or other memory techniques, read-only disc read only memory (CD-ROM) (CD-ROM),
Digital versatile disc (DVD) or other optical storages, magnetic tape cassette, tape magnetic disk storage or other magnetic storage apparatus
Or any other non-transmission medium, it can be used for storage and can be accessed by a computing device information.As defined in this article, it calculates
Machine readable medium does not include temporary computer readable media (transitory media), such as data-signal and carrier wave of modulation.
It should also be noted that, the terms "include", "comprise" or its any other variant are intended to nonexcludability
Including so that including a series of elements process, method, data object or equipment not only include those elements, but also
Further include other elements that are not explicitly listed, or further includes solid by this process, method, data object or equipment
Some elements.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that including
There is also other identical elements in the process of the element, method, data object or equipment.
It should be understood by those skilled in the art that, embodiments herein can be provided as method, equipment or computer program
Product.Therefore, complete hardware embodiment, complete software embodiment or reality combining software and hardware aspects can be used in the application
Apply the form of example.Moreover, the application can be used in one or more wherein include computer usable program code computer
Usable storage medium(Including but not limited to magnetic disk storage, CD-ROM, optical memory etc.)The computer program of upper implementation produces
The form of product.
Above is only an example of the present application, it is not intended to limit this application, for those skilled in the art
For member, the application can have various modifications and variations.Any modification made by within the spirit and principles of this application,
Equivalent replacement, improvement etc., should be included within the scope of claims hereof.
Claims (18)
1. a kind of method of classification for prediction data object, which is characterized in that including:
At least one characteristics of objects is extracted from data object to be predicted, and part of speech mark is carried out at least one characteristics of objects
Note processing;
According to the characteristics of objects and part-of-speech tagging handling result, from being in advance based in database data with existing object and corresponding
In characteristics tree constructed by data object classification, characteristic set is obtained, the characteristic set includes to exist in the characteristics of objects
The single characteristics of objects contacted is not present with other characteristics of objects in the characteristics of objects pair of contact and the characteristics of objects;
According to the characteristic set, from being in advance based on data with existing object and corresponding data object classification and institute in database
It states in feature-classification probability distribution that characteristics tree is counted, obtains and each characteristics of objects pair or object in the characteristic set
The corresponding each classification probability distribution of feature;And
According to each classification probability distribution, the prediction classification set of data object to be predicted is determined.
2. according to the method described in claim 1, it is characterized in that, being in advance based in database data with existing object and corresponding
The step of data object classification structure characteristics tree, further comprises:
At least one primitive character is extracted from data with existing object in database;
Determine the relevance between the feature two-by-two in the primitive character;And
According to the relevance between the feature two-by-two, build wherein using the primitive character as the characteristics tree of node.
3. according to the method described in claim 2, it is characterized in that, determining the pass between the feature two-by-two in the primitive character
The step of connection property, further comprises:
According to mutual information of the feature two-by-two in inhomogeneity now in the primitive character, determine described in the pass between feature two-by-two
Connection property.
4. according to the method described in claim 3, it is characterized in that, the feature two-by-two according in the primitive character is not
Similar mutual information now, determine described in two-by-two relevance between feature the step of further comprise:
It is distributed according to the distribution of the classification of the feature two-by-two in the primitive character and the respective classification of the two features, described in statistics
Mutual information of the feature in each class now two-by-two;
The feature two-by-two is added to obtain total mutual information of the feature two-by-two in the mutual information of each class now;And
According to total mutual information, determine described in the relevance between feature two-by-two.
5. according to the method described in claim 2, it is characterized in that, the relevance between feature two-by-two described in the basis, structure
It builds and wherein further comprises by the step of characteristics tree of node of the primitive character:
Using the primitive character as node, according to the relevance between the feature two-by-two, the node with relevance is connected
Together, maximum spanning tree is generated;
Any node in the maximum spanning tree is chosen as root node, topology is generated based on the maximum spanning tree
Tree;And
Increase the connection between each node in the case where not changing the topological structure of topological tree so that each node can with two with
Upper and 100 other nodes below are attached, to which structure obtains the characteristics tree.
6. according to the method described in claim 2, it is characterized in that, described in determining two-by-two relevance between feature the step of
Later, and before the step of building wherein using the primitive character as the characteristics tree of node, further include:
According to the relevance between the feature two-by-two, processing is filtered to primitive character.
7. according to the method described in any one of claim 1-6, which is characterized in that be in advance based on data with existing pair in database
As and corresponding data object classification and the characteristics tree carry out statistical nature-classification probability distribution the step of further comprise:
Obtain primitive character set corresponding with each data object respectively from the characteristics tree, the primitive character set includes
Exist to be not present with other primitive characters in the primitive character pair contacted and the primitive character in the primitive character and contact
Single primitive character;And
According to the corresponding primitive character set of each data object and data object classification, each original in primitive character set is counted
Beginning feature pair or primitive character are in all kinds of total degrees occurred now and the number occurred now in each class, to obtain spy
Sign-classification probability distribution.
8. according to the method described in any one of claim 1-6, which is characterized in that from data object to be predicted extraction to
After the step of few characteristics of objects, and according to the characteristics of objects, from being in advance based on data with existing object in database
And before the step of obtaining characteristic set in the characteristics tree constructed by corresponding data object classification, further include:
Processing is filtered to the characteristics of objects.
9. according to the method described in any one of claim 1-6, which is characterized in that according to each classification probability distribution, really
The step of prediction classification set of fixed data object to be predicted, further comprises:
The corresponding score of classification probability distribution of each feature in characteristic set is added according to classification;
Classification is sorted according to the height of the score;And
Classification after output sequence is as prediction classification set.
10. a kind of class destination device for prediction data object, which is characterized in that including:
Characteristic extracting module, for extracting at least one characteristics of objects from data object to be predicted, and to described at least one
Characteristics of objects carries out part-of-speech tagging processing;
First acquisition module, for according to the characteristics of objects and part-of-speech tagging handling result, from being in advance based in database
Have in the characteristics tree constructed by data object and corresponding data object classification, obtain characteristic set, the characteristic set includes
Exist to be not present with other characteristics of objects in the characteristics of objects pair contacted and the characteristics of objects in the characteristics of objects and contact
Single characteristics of objects;
Second acquisition module, for according to the characteristic set, from being in advance based in database data with existing object and corresponding
In the feature that data object classification and the characteristics tree are counted-classification probability distribution, in acquisition and the characteristic set
Each characteristics of objects pair or the corresponding each classification probability distribution of characteristics of objects;And
Classification determining module, for according to each classification probability distribution, determining the prediction classification set of data object to be predicted.
11. device according to claim 10, which is characterized in that the first acquisition module further comprises:
Primitive character extracting sub-module, for extracting at least one primitive character from data with existing object in database;
Relevance determination sub-module, for determining the relevance between the feature two-by-two in the primitive character;And
Characteristics tree builds submodule, for according to the relevance between the feature two-by-two, structure to be wherein with the primitive character
For the characteristics tree of node.
12. according to the devices described in claim 11, which is characterized in that relevance determination sub-module is according in the primitive character
Mutual information of the feature two-by-two in inhomogeneity now, determine described in the relevance between feature two-by-two.
13. device according to claim 12, which is characterized in that relevance determination sub-module further comprises:
Statistic submodule, for according to the distribution of the classification of the feature two-by-two in the primitive character and the respective class of the two features
Mesh is distributed, mutual information of the feature in each class now two-by-two described in statistics;
It is added submodule, the total of the feature two-by-two is obtained for the feature two-by-two to be added in the mutual information of each class now
Mutual information;And
Determination sub-module, for the relevance between feature two-by-two described according to total mutual information, determining.
14. according to the devices described in claim 11, which is characterized in that the characteristics tree structure submodule further comprises:
First generates submodule, for using the primitive character as node, according to the relevance between the feature two-by-two, will have
The node of relevant property links together, and generates maximum spanning tree;
Second generates submodule, for choosing any node in the maximum spanning tree as root node, with described maximum raw
Topological tree is generated based on Cheng Shu;And
Increase submodule, for increasing the connection between each node in the case where not changing the topological structure of topological tree so that every
A node can be attached with more than two and 100 other nodes below, to which structure obtains the characteristics tree.
15. according to the devices described in claim 11, which is characterized in that the first acquisition module further includes:
Filter submodule is filtered processing for the relevance between feature two-by-two according to primitive character.
16. according to the device described in any one of claim 10-15, which is characterized in that the second acquisition module further comprises:
Gather acquisition submodule, for obtaining primitive character set corresponding with each data object respectively from the characteristics tree,
The primitive character set include the primitive character in exist contact primitive character pair and the primitive character in its
There is no the single primitive characters of contact for its primitive character;And
Classification distribution statistics submodule is used for according to the corresponding primitive character set of each data object and data object classification,
Each primitive character pair or primitive character in statistics primitive character set is in all kinds of total degrees occurred now and in each class
Now the number occurred, to obtain feature-classification probability distribution.
17. according to the device described in any one of claim 10-15, which is characterized in that further include:
Filtering module, for being filtered processing to the characteristics of objects.
18. according to the device described in any one of claim 10-15, which is characterized in that classification determining module further comprises:
Score is added submodule, is used for the corresponding score of classification probability distribution of each feature in characteristic set according to classification phase
Add;
Sorting sub-module, for classification to sort according to the height of the score;And
Output sub-module, for exporting the classification after sorting as prediction classification set.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310542419.5A CN104615605B (en) | 2013-11-05 | 2013-11-05 | The method and apparatus of classification for prediction data object |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310542419.5A CN104615605B (en) | 2013-11-05 | 2013-11-05 | The method and apparatus of classification for prediction data object |
Publications (2)
Publication Number | Publication Date |
---|---|
CN104615605A CN104615605A (en) | 2015-05-13 |
CN104615605B true CN104615605B (en) | 2018-07-24 |
Family
ID=53150055
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201310542419.5A Active CN104615605B (en) | 2013-11-05 | 2013-11-05 | The method and apparatus of classification for prediction data object |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104615605B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109543188A (en) * | 2018-11-23 | 2019-03-29 | 珠海格力电器股份有限公司 | A kind of method of mapping, device, server and readable storage medium storing program for executing |
CN110008240A (en) * | 2019-04-15 | 2019-07-12 | 重庆天蓬网络有限公司 | A kind of method and system for extracting unique object in set |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103245861A (en) * | 2013-05-03 | 2013-08-14 | 云南电力试验研究院(集团)有限公司电力研究院 | Transformer fault diagnosis method based on Bayesian network |
-
2013
- 2013-11-05 CN CN201310542419.5A patent/CN104615605B/en active Active
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103245861A (en) * | 2013-05-03 | 2013-08-14 | 云南电力试验研究院(集团)有限公司电力研究院 | Transformer fault diagnosis method based on Bayesian network |
Non-Patent Citations (3)
Title |
---|
comparing bayesian network classifiers;Jie Cheng;《in the proceeding of the 15th conference on Uncertainty in artificial intelligence》;19991231;全文 * |
基于依赖分析的贝叶斯网络结构学习和分类器的研究与实现;关菁华;《中国优秀硕士学位论文全文数据库 信息科技辑》;20050615;第13页,第28页-第33页 * |
扩展的树增强朴素贝叶斯分类器;李旭升等;《模式识别与人工智能》;20060830;全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN104615605A (en) | 2015-05-13 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Zhang et al. | Fraudre: Fraud detection dual-resistant to graph inconsistency and imbalance | |
US20210191509A1 (en) | Information recommendation method, device and storage medium | |
CN109471938A (en) | A kind of file classification method and terminal | |
CN108182279A (en) | Object classification method, device and computer equipment based on text feature | |
CN107835113A (en) | Abnormal user detection method in a kind of social networks based on network mapping | |
TW200900958A (en) | Link spam detection using smooth classification function | |
CN112650923A (en) | Public opinion processing method and device for news events, storage medium and computer equipment | |
CN104239553A (en) | Entity recognition method based on Map-Reduce framework | |
CN107918657A (en) | The matching process and device of a kind of data source | |
CN107679135A (en) | The topic detection of network-oriented text big data and tracking, device | |
Dareddy et al. | motif2vec: Motif aware node representation learning for heterogeneous networks | |
CN110287329A (en) | A kind of electric business classification attribute excavation method based on commodity text classification | |
CN108647800A (en) | A kind of online social network user missing attribute forecast method based on node insertion | |
CN103761286B (en) | A kind of Service Source search method based on user interest | |
US20230056760A1 (en) | Method and apparatus for processing graph data, device, storage medium, and program product | |
Deylami et al. | Link prediction in social networks using hierarchical community detection | |
CN114416998A (en) | Text label identification method and device, electronic equipment and storage medium | |
CN105426392A (en) | Collaborative filtering recommendation method and system | |
Kobyshev et al. | Hybrid image recommendation algorithm combining content and collaborative filtering approaches | |
CN116823410B (en) | Data processing method, object processing method, recommending method and computing device | |
CN104615605B (en) | The method and apparatus of classification for prediction data object | |
CN107527289B (en) | Investment portfolio industry configuration method, device, server and storage medium | |
CN105930358B (en) | Case retrieving method and its system based on the degree of association | |
CN112579781A (en) | Text classification method and device, electronic equipment and medium | |
CN113836395B (en) | Service developer on-demand recommendation method and system based on heterogeneous information network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |