CN109492093A

CN109492093A - File classification method and electronic device based on gauss hybrid models and EM algorithm

Info

Publication number: CN109492093A
Application number: CN201811159037.3A
Authority: CN
Inventors: 金戈; 徐亮; 肖京
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2018-09-30
Filing date: 2018-09-30
Publication date: 2019-03-19

Abstract

The invention belongs to big data analysis technical fields, disclose a kind of file classification method based on gauss hybrid models and EM algorithm, applied to electronic device, the following steps are included: step S1, being pre-processed to existing data set text, training set is constructed, includes category training sample and without category training sample；Step S2, the gauss hybrid models based on EM algorithm are constructed；Step S3, category training sample initializes the parameter of the gauss hybrid models according to；Step S4, with the parameter of the EM algorithm training gauss hybrid models, textual classification model is obtained；Step S5, classified using textual classification model to text to be sorted.The present invention carries out semi-supervised learning to training sample, reduces the dependence to the data resource quantity for having category training sample, improves the precision of textual classification model using no category training sample, improve the accuracy of text classification.The invention also discloses a kind of electronic devices.

Description

File classification method and electronic device based on gauss hybrid models and EM algorithm

Technical field

The present invention relates to big data analysis technical fields, more particularly to the text based on gauss hybrid models and EM algorithm point Class method and electronic device.

Background technique

Text classification is mainly used for information retrieval, machine translation, automatic abstract and information filtering etc..With information technology Development, data are in explosive growth, the feature with higher-dimension and mass data, textual classification model need a large amount of mark samples into Row training, but having marked of the providing information that sample can be provided may be subjective and limited, and equally, not marking sample may contain Text distributed intelligence abundant.Currently, classified using supervised learning model to text, but the precision of supervised learning model Need to rely on the data resource quantity containing mark, and existing NB Algorithm is simple and efficient, compared with other sorting algorithms Time complexity is low, high-efficient, and is widely used in classification task, but NB Algorithm is in processing mass text When classification data, accuracy rate is likewise dependent upon the training data containing mark, has marked so that training pattern precision depends on Training sample, the precision of training pattern is low, influences classifying quality.

Summary of the invention

The present invention provides a kind of file classification method and electronic device based on gauss hybrid models and EM algorithm, to solve Training pattern precision is higher to the dependence for having marked training sample, and the problem of influence classifying quality, so as to by mentioning Rise accuracy of the model accuracy raising to text classification.

To achieve the goals above, it is an aspect of the invention to provide a kind of based on gauss hybrid models and EM algorithm File classification method is applied to electronic device, comprising the following steps:

Step S1, existing data set text is pre-processed, constructs training set, the training set includes category instruction Practice sample and without category training sample；

Step S2, the gauss hybrid models based on EM algorithm are constructed；

Step S3, category training sample initializes the parameter of the gauss hybrid models according to；

Step S4, with the parameter of the EM algorithm training gauss hybrid models, textual classification model is obtained；

Step S5, classified using textual classification model to text to be sorted.

Preferably, the step S4 includes:

S41, the gauss hybrid models parameter by initialization is substituted into EM iterative equation；

S42, the predicted value for walking to obtain the no category training sample by the E in the EM iterative equation and corresponding pre- Category is surveyed, prediction category is introduced into training set, updates the training set；

S43, using updated training set, the M step passed through in the EM iterative equation updates the gauss hybrid models Parameter, complete an iteration；

S44, judge whether gauss hybrid models training meets termination condition, if meeting termination condition, export text point Class model, if being unsatisfactory for termination condition, return step S42 continues the parameter for training gauss hybrid models, wherein the knot Beam condition includes the first termination condition and/or the second termination condition, and first termination condition is that the number of iterations is greater than setting Maximum number of iterations, the second termination condition are that the adjacent difference for being iterating through the predicted value that E is walked twice is less than setting target Value.

Preferably, in the step S2, the gauss hybrid models based on EM algorithm are shown below:

Wherein, i indicates that the index of the affiliated text categories of training sample, x indicate the feature vector of training sample, and π indicates high The mixed coefficint of this mixed model, μ indicate that the mean vector of feature vector x, ∑ indicate covariance matrix, μ_iIndicate that the i-th class is instructed Practice the mean vector of sampling feature vectors, ∑_iIndicate the covariance matrix of the i-th class training sample feature vector, π_iIndicate that Gauss is mixed The mixed coefficint of molding type the i-th class training sample, and N (x | μ_i,∑_i) indicate μ_iAnd ∑_iUnder the conditions of training sample x belong to the i-th class text This probability, p indicate the conditional probability of training sample x.

Preferably, in step S3, the parameter of the gauss hybrid models includes μ_i、∑_i、π_i, initialization solves according to the following formula The parameter of the gauss hybrid models:

Wherein, j indicates that the index of training sample, l indicate the quantity of category training sample, and n indicates the number of training sample Amount, i indicate that the index of the affiliated text categories of training sample, x indicate the feature vector of training sample, μ_iIndicate the i-th class training sample The mean vector of eigen vector, ∑_iIndicate the covariance matrix of the i-th class training sample feature vector, π_iIndicate Gaussian Mixture mould The mixed coefficint of type the i-th class training sample, x_jIndicate the feature vector of j-th of training sample, γ_ijIndicate j-th of training sample Belong to the probability value of the i-th class text classification.

Preferably, the predicted value of no category training sample is calculated in the E step in the EM iterative equation according to the following formula:

Wherein, i indicates that the index of the affiliated text categories of training sample, j indicate that the index of training sample, m indicate training sample The total quantity of text categories belonging to this, x_jIndicate the feature vector of j-th of training sample, μ_iIndicate the i-th class training sample feature to The mean vector of amount, ∑_iIndicate the covariance matrix of the i-th class training sample feature vector, π_iIndicate the i-th class of gauss hybrid models The mixed coefficint of training sample, N (x_j|μ_i,∑_i) indicate μ_iAnd ∑_iUnder the conditions of j-th of training sample belong to the general of the i-th class text Rate, γ_ijIndicate that j-th of training sample belongs to the probability value of the i-th class text classification.

Preferably, the M step in the EM iterative equation updates the parameter of the gauss hybrid models according to the following formula:

Wherein, i indicates that the index of the affiliated text categories of training sample, j indicate that the index of training sample, n indicate training sample This quantity, μ_iIndicate the mean vector of the i-th class training sample feature vector, ∑_iIndicate the i-th class training sample feature vector Covariance matrix, π_iIndicate the mixed coefficint of the i-th class of gauss hybrid models training sample, x_jIndicate the feature of j-th of training sample Vector, γ_ijIndicate that j-th of training sample belongs to the probability value of the i-th class text classification.

Preferably, carrying out classification to text to be sorted using textual classification model includes:

Data text to be sorted is pre-processed, according to term vector library, converts term vector for data text；

The corresponding feature vector of data text is obtained according to term vector, the input as textual classification model；

It will be in the corresponding feature vector input textual classification model of data text；

The probability value that data text to be sorted belongs to each text categories is exported by the textual classification model, it is maximum The corresponding text categories of probability value be text categories belonging to data text to be sorted.

To achieve the goals above, another aspect of the present invention is just to provide a kind of electronic device, comprising: processor； Memory includes text classification program in the memory, and the processor executes the text classification program, realizes following step It is rapid:

Existing data set text is pre-processed, constructs training set, the training set includes category training sample With no category training sample；

Construct the gauss hybrid models based on EM algorithm；

According to the parameter for thering is category training sample to initialize the gauss hybrid models；

With the parameter of the EM algorithm training gauss hybrid models, textual classification model is obtained；

Classified using textual classification model to text to be sorted.

Preferably, the processor includes: with the parameter of the EM algorithm training gauss hybrid models

Gauss hybrid models parameter by initialization is substituted into EM iterative equation；

It walks to obtain the predicted value of the no category training sample by the E in the EM iterative equation and accordingly predicts class Prediction category is introduced training set, updates the training set by mark；

Using updated training set, the ginseng of the gauss hybrid models is updated by the M step in the EM iterative equation Number completes an iteration；

Judge whether gauss hybrid models training meets termination condition, if meeting termination condition, exports text classification mould Type continues the parameter for training gauss hybrid models if being unsatisfactory for termination condition, wherein the termination condition includes the first knot Beam condition and/or the second termination condition, maximum number of iterations of first termination condition for the number of iterations greater than setting, second Termination condition is that the adjacent difference for being iterating through the predicted value that E is walked twice is less than set target value.

Wherein, i indicates that the index of the affiliated text categories of training sample, j indicate that the index of training sample, m indicate training sample The total quantity of text categories belonging to this, x_jIndicate the feature vector of j-th of training sample, μ_iIndicate the i-th class training sample feature to The mean vector of amount, ∑_iIndicate the covariance matrix of the i-th class training sample feature vector, π_iIndicate the i-th class of gauss hybrid models The mixed coefficint of training sample, N (x_j|μ_i,∑_i) indicate μ_iAnd ∑_iUnder the conditions of j-th of training sample belong to the general of the i-th class text Rate, γ_ijIndicate that j-th of training sample belongs to the probability value of the i-th class text classification；

M step in the EM iterative equation updates the parameter of the gauss hybrid models according to the following formula:

Compared with the existing technology, the present invention has the following advantages and beneficial effects:

File classification method and electronic device of the present invention based on gauss hybrid models and EM algorithm, by establishing base In the gauss hybrid models of EM algorithm, completed using gauss hybrid models and EM algorithm pre- for the mark of no labeled data collection It surveys, realizes the semi-supervised learning to training sample, reduce dependence of the training pattern to the training dataset containing mark, make full use of Text is effectively performed to improve the accuracy to text classification in precision without the further training for promotion model of labeled data Classification.

Detailed description of the invention

Fig. 1 is the flow diagram of file classification method of the present invention；

Fig. 2 is the module diagram of this Chinese sort program of the present invention.

The embodiments will be further described with reference to the accompanying drawings for the realization, the function and the advantages of the object of the present invention.

Specific embodiment

Embodiment of the present invention described below with reference to the accompanying drawings.Those skilled in the art may recognize that It arrives, it without departing from the spirit and scope of the present invention, can be with a variety of different modes or combinations thereof to described Embodiment is modified.Therefore, attached drawing and description are regarded as illustrative in nature, and are only used to explain the present invention, rather than are used In limitation scope of protection of the claims.In addition, in the present specification, attached drawing is drawn not in scale, and identical attached drawing mark Note indicates identical part.

Fig. 1 is the flow diagram of file classification method of the present invention, as shown in Figure 1, of the present invention be based on Gauss The file classification method of mixed model and EM algorithm is applied to electronic device, comprising the following steps:

Step S1, existing data set text is pre-processed, constructs training set, the training set includes category instruction Practice sample and without category training sample, wherein there is category training sample to indicate that the training sample has classification mark corresponding thereto Label, no category training sample indicate the class label of the training sample not corresponding thereto, and category is belonging to training sample The abbreviation of class label, it is following to be referred to as category；

Step S2, the gauss hybrid models based on EM algorithm are constructed；

Step S5, classified using textual classification model to text to be sorted.

The present invention utilizes gauss hybrid models and the completion pair of EM algorithm by establishing the gauss hybrid models based on EM algorithm It is predicted in the mark of no labeled data collection, obtains textual classification model, realize the semi-supervised learning to training sample, reduced pair There is the dependence of the data resource quantity of category training sample, improves the essence of textual classification model using no category training sample Degree, improves the accuracy of text classification.

Preferably, the step S4 includes:

S44, judge whether gauss hybrid models training meets termination condition, if meeting termination condition, export text point Class model returns to the step S42 if being unsatisfactory for termination condition, continues the parameter for training gauss hybrid models, wherein institute Stating termination condition includes the first termination condition and/or the second termination condition, and first termination condition is greater than for the number of iterations to be set Fixed maximum number of iterations, the second termination condition are that the adjacent difference for being iterating through the predicted value that E is walked twice is less than setting Target value.

The category of no category training sample is updated by E step, is carried out by parameter of the M step to gauss hybrid models It updates, no category training sample is utilized and carries out model learning, improves the precision of model.

In one embodiment of the present of invention, the step S1 includes: building term vector library；Data set text is divided Word, word frequency statistics and duplicate removal convert term vector for data set text according to the term vector library；From data with existing collection text Middle selection training sample includes category training sample and without category training sample, according to the corresponding term vector of training sample Obtain the feature vector of training sample；Training set is constructed according to the feature vector of training sample and corresponding category.For example, from Have and select n data text therein as training sample in data set text, obtain the corresponding feature vector of training text with And category constructs to form training set, wherein in n training sample, including l have category training sample and u a without category instruction Practice sample, then the training set constructed is D={ (x₁,y₁),(x₂,y₂),…,(x_l,y_l),x_l+1,x_l+2,…,x_l+u, x indicates training The feature vector of sample, y indicate the category of training sample.

It should be noted that various term vector models can be used by converting term vector for data set text in the present invention, For example, Word2Vec model, CBOW model etc..

In the processing to data set text, duplicate removal is in order to delete duplicate keyword, to avoid belonging to a different category Text in include identical keyword, influence classification results.

In one embodiment of the present of invention, the corresponding term vector of training sample is sought into mean value according to vector dimension, is obtained To the feature vector of training sample.

Wherein, i indicates that the index of the affiliated text categories of training sample, x indicate the feature vector of training sample, and π indicates high The mixed coefficint of this mixed model, μ indicate that the mean vector of feature vector x, ∑ indicate covariance matrix, μ_iIndicate that the i-th class is instructed Practice the mean vector of sampling feature vectors, ∑_iIndicate the covariance matrix of the i-th class training sample feature vector, Nix | μ_i,∑_i) Indicate μ_iAnd ∑_iUnder the conditions of training sample x belong to the probability of the i-th class text, p indicates the conditional probability of training sample x.

Preferably, in step S3, according to the parameter for having category training sample initialization gauss hybrid models, the parameter packet Include μ_i、∑_i、π_i, the initial parameter of the gauss hybrid models is solved according to the following formula:

Wherein, j indicates that the index of training sample, l indicate the quantity of category training sample, and n indicates the number of training sample Amount, i indicate that the index of the affiliated text categories of training sample, x indicate the feature vector of training sample, x_jIndicate j-th of trained sample This feature vector, γ_ijIndicate that j-th of training sample belongs to the probability value of the i-th class text classification.

Preferably, in the step S42, no category training sample is calculated in the E step in EM iterative equation according to the following formula Predicted value:

Wherein, i indicates that the index of the affiliated text categories of training sample, j indicate that the index of training sample, m indicate training sample The total quantity of text categories belonging to this, x_jIndicate the feature vector of j-th of training sample, N (x_j|μ_i,∑_i) indicate μ_iAnd ∑_iItem J-th of training sample belongs to the probability of the i-th class text, γ under part_ijIndicate that j-th of training sample belongs to the i-th class text classification Probability value.

The γ that E is walked_ijAs the predicted value of each no category training sample, according to γ_ijSize, by γ_ijIt is maximum Value determine j-th without text categories belonging to category training sample, using this text categories as this without category training sample It predicts category, and prediction category is introduced into training set, so that the training sample in training set contains category, with updated The training of gauss hybrid models parameter is carried out based on training set.For example, the mixed Gauss model according to obtained in step S3 Initial parameter μ_i、∑_i、π_iAnd without category training sample { x in training set_l+1,x_l+2,…,x_l+u, the prediction result walked by E It obtains and no category training sample { x_l+1,x_l+2,…,x_l+uCorresponding prediction class is designated as { y_l+1,y_l+2,…,y_l+u, it will predict Category introduces training set, obtains updated training set D '={ (x₁,y₁),(x₂,y₂),…,(x_l,y_l),(x_l+1,y_l+1), (x_l+2,y_l+2),…,(x_l+u,y_l+u)}。

Preferably, in the step S43, the M step in the EM iterative equation updates the Gaussian Mixture mould according to the following formula The parameter of type:

Wherein, i indicates that the index of the affiliated text categories of training sample, j indicate that the index of training sample, n indicate training sample This quantity, x_jIndicate the feature vector of j-th of training sample, γ_ijIndicate that j-th of training sample belongs to the i-th class text classification Probability value.For having category training sample i.e. j ∈ { 1,2 ..., l }, when training sample generic is corresponding with category, γ_ij Value is 1, training sample generic and category not to it is corresponding when γ_ijValue is 0, for example, when for j ∈ { 1,2 ..., l }, if only 1st and the 2nd training sample generic are the i-th class, then γ_i1=1, γ_i2=1, and remaining γ_ij=0, j ∈ 3, 4,…,l}.For no category training sample, that is, j ∈ { l+1, l+2 ..., n }, γ_ijValue walks formula calculating according to E and obtains.

Training set the D '={ (x updated according to step S42₁,y₁),(x₂,y₂),…,(x_l,y_l),(x_l+1,y_l+1), (x_l+2,y_l+2),…,(x_l+u,y_l+u) update for carrying out gauss hybrid models parameter is walked by M.

No category training sample is predicted by E step using updated gauss hybrid models, obtains no category instruction The prediction category for practicing sample, updates training set again, updates gauss hybrid models again by M step using updated training set Parameter, circuit sequentially the E step and M step of EM iterative equation, until meeting training termination condition, the parameter of gauss hybrid models becomes In stabilization, textual classification model is exported.

Preferably, in step S5, carrying out classification to text to be sorted using textual classification model includes:

The probability value that data text to be sorted belongs to each text categories is exported by the textual classification model, according to The size of probability value determines text categories belonging to data text to be sorted, wherein the corresponding text class of maximum probability value Text categories belonging to data text that Ji Wei be not to be sorted.

File classification method of the present invention based on gauss hybrid models and EM algorithm is applied to electronic device, the electricity Sub-device can be the terminal device that smart phone, tablet computer, computer etc. have calculation function.

The electronic device includes: processor；Memory includes text classification program, the processing in the memory Device executes the text classification program, realizes following steps:

Construct the gauss hybrid models based on EM algorithm；

Classified using textual classification model to text to be sorted.

In the present invention, processor is for the storage program in run memory, to realize text classification, for example, processor It can be with central processing unit, microprocessor or other data processing chips.

In the present invention, memory is used for the program that storage processor needs to be implemented, and readable including at least one type is deposited Storage media, for example, the non-volatile memory mediums such as flash memory, hard disk.Memory can be the internal storage unit of electronic device, It can be external memory, such as plug-in type hard disk, flash card or other kinds of storage card etc..The present invention is not limited to This, memory can be in a manner of non-transitory store instruction or software and any associated data file and to processor Instruction or software program is provided so that the processor is able to carry out any device of instruction or software program.

Electronic device of the present invention completes the mark for no labeled data collection using gauss hybrid models and EM algorithm Prediction obtains textual classification model, realizes to the semi-supervised learning of training sample, reduces to the data for having category training sample The dependence of resource quantity improves the precision of textual classification model using no category training sample, improves the accurate of text classification Degree.

In the present invention, processor executes text classification program, with the parameter packet of the EM algorithm training gauss hybrid models It includes:

In one embodiment of the present of invention, existing data set text is pre-processed, building training set includes: building Term vector library；Data set text is segmented, word frequency statistics and duplicate removal turn data set text according to the term vector library Turn to term vector；Training sample is selected from data with existing collection text, includes category training sample and without category training sample, The feature vector of training sample is obtained according to the corresponding term vector of training sample；According to the feature vector and correspondence of training sample Category construct training set.For example, selecting n data text therein as training sample from data with existing collection text, obtain The corresponding feature vector of training text and category is taken to construct to form training set, wherein in n training sample, including l have Category training sample and a training set without category training sample, then constructed of u are D={ (x₁,y₁),(x₂,y₂),…,(x_l,y_l), x_l+1,x_l+2,…,x_l+u, x indicates the feature vector of training sample, and y indicates the category of training sample.

The gauss hybrid models for being preferably based on EM algorithm are shown below:

Wherein, i indicates that the index of the affiliated text categories of training sample, x indicate the feature vector of training sample, and π indicates high The mixed coefficint of this mixed model, μ indicate that the mean vector of feature vector x, ∑ indicate covariance matrix, μ_iIndicate that the i-th class is instructed Practice the mean vector of sampling feature vectors, ∑_iIndicate the covariance matrix of the i-th class training sample feature vector, and N (x | μ_i, ∑_i) Indicate μ_iAnd ∑_iUnder the conditions of training sample x belong to the probability of the i-th class text, p indicates the conditional probability of training sample x.

Preferably, according to the parameter for having category training sample initialization gauss hybrid models, the parameter includes μ_i、∑_i、 π_i, the initial parameter of the gauss hybrid models is solved according to the following formula:

Preferably, the predicted value of no category training sample is calculated in the E step in EM iterative equation according to the following formula:

Wherein, i indicates that the index of the affiliated text categories of training sample, j indicate that the index of training sample, m indicate training sample The total quantity of text categories belonging to this, x_jIndicate the feature vector of j-th of training sample, N (x_j|μ_i, ∑_i) indicate μ_iAnd ∑_iItem J-th of training sample belongs to the probability of the i-th class text, γ under part_ijIndicate that j-th of training sample belongs to the i-th class text classification Probability value.

The γ that E is walked_ijAs the predicted value of each no category training sample, according to γ_ijSize, by γ_ijIt is maximum Value determine j-th without text categories belonging to category training sample, using this text categories as this without category training sample It predicts category, and prediction category is introduced into training set, so that the training sample in training set contains category, with updated The training of gauss hybrid models parameter is carried out based on training set.For example, according to the initial parameter μ of mixed Gauss model_i、∑_i、 π_iAnd without category training sample { x in training set_l+1,x_l+2,…,x_l+u, it obtains instructing with no category by the prediction result that E is walked Practice sample { x_l+1,x_l+2,…,x_l+uCorresponding prediction class is designated as { y_l+1,y_l+2,…,y_l+u, prediction category is introduced into training Collection, obtains updated training set D '={ (x₁,y₁),(x₂,y₂),…,(x_l,y_l),(x_l+1,y_l+1),(x_l+2,y_l+2),…, (x_l+u,y_l+u)}。

Wherein, i indicates that the index of the affiliated text categories of training sample, j indicate that the index of training sample, n indicate training sample This quantity, x_jIndicate the feature vector of j-th of training sample, γ_ijIndicate that j-th of training sample belongs to the i-th class text classification Probability value.For having category training sample i.e. j ∈ { 1,2 ..., l }, when training sample generic is corresponding with category, γ_ij Value is 1, training sample generic and category not to it is corresponding when γ_ijValue is 0, for example, when for j ∈ { 1,2 ..., l }, if only 1st and the 2nd training sample generic are the i-th class, then γ_i1=1, γ_i2=1, and remaining γ_ij=0, j ∈ 3, 4 ..., l }.For no category training sample, that is, j ∈ { l+1, l+2 ..., n }, γ_ijValue walks formula calculating according to E and obtains.

Training set the D '={ (x obtained according to update₁,y₁),(x₂,y₂),…,(x_l,y_l),(x_l+1,y_l+1),(x_l+2, y_l+2),…,(x_l+u,y_l+u) update for carrying out gauss hybrid models parameter is walked by M.

In one embodiment of the invention, text classification program can be divided into one or more modules, one or Multiple modules are stored in memory, and are executed by processor, to realize text classification.Module of the present invention is can be complete At the series of computation machine program instruction section of specific function.Fig. 2 is the module diagram of this Chinese sort program of the present invention, is such as schemed Shown in 2, training set obtains module 1, model construction module 2, initialization module 3, model training module 4, categorization module 5, each The functions or operations step that module is realized is similar as above, and and will not be described here in detail, illustratively, such as wherein:

Training set obtains module 1, pre-processes to existing data set text, constructs training set, the training set packet Category training sample is included and without category training sample；

Model construction module 2 constructs the gauss hybrid models based on EM algorithm；

Initialization module 3, according to the parameter for thering is category training sample to initialize the gauss hybrid models；

Model training module 4 obtains textual classification model with the parameter of the EM algorithm training gauss hybrid models；

Categorization module 5 classifies to text to be sorted using textual classification model；

Further, the model training module 4 includes:

Gauss hybrid models parameter by initialization is substituted into EM iterative equation by parameter input unit 41；

Category predicting unit 42 walks to obtain the prediction of the no category training sample by the E in the EM iterative equation Value and category is predicted accordingly, prediction category is introduced into training set, updates the training set；

Parameter updating unit 43 updates the height by the M step in the EM iterative equation using updated training set The parameter of this mixed model completes an iteration；

Judging unit 44, judges whether gauss hybrid models training meets termination condition, if meeting termination condition, exports Textual classification model recycles the E step and M step of EM iterative equation, the ginseng of training gauss hybrid models if being unsatisfactory for termination condition Number, wherein the termination condition includes the first termination condition and/or the second termination condition, and first termination condition is iteration Number is greater than the maximum number of iterations of setting, and the second termination condition is the adjacent difference for being iterating through the predicted value that E is walked twice Value is less than set target value.

It should be noted that, in this document, the terms "include", "comprise" or its any other variant are intended to non-row His property includes, so that the process, device, article or the method that include a series of elements not only include those elements, and And further include other elements that are not explicitly listed, or further include for this process, device, article or method institute it is intrinsic Element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that including being somebody's turn to do There is also other identical elements in the process, device of element, article or method.

Through the above description of the embodiments, those skilled in the art can be understood that above-described embodiment side Method can be realized by means of software and necessary general hardware platform, naturally it is also possible to by hardware, but in many cases The former is more preferably embodiment.Based on this understanding, technical solution of the present invention substantially in other words does the prior art The part contributed out can be embodied in the form of software products, which is stored in one as described above In storage medium (such as ROM/RAM, magnetic disk, CD), including some instructions are used so that terminal device (it can be mobile phone, Computer, server or network equipment etc.) execute method described in each embodiment of the present invention.

The above is only a preferred embodiment of the present invention, is not intended to limit the scope of the invention, all to utilize this hair Equivalent structure or equivalent flow shift made by bright specification and accompanying drawing content is applied directly or indirectly in other relevant skills Art field, is included within the scope of the present invention.

Claims

1. a kind of file classification method based on gauss hybrid models and EM algorithm is applied to electronic device, which is characterized in that packet Include following steps:

Step S1, existing data set text is pre-processed, constructs training set, the training set includes category training sample Originally and without category training sample；

Step S2, the gauss hybrid models based on EM algorithm are constructed；

Step S5, classified using textual classification model to text to be sorted.

2. file classification method according to claim 1, which is characterized in that the step S4 includes:

S42, the predicted value for walking to obtain the no category training sample by the E in the EM iterative equation and class is predicted accordingly Prediction category is introduced training set, updates the training set by mark；

S43, using updated training set, pass through the ginseng that the M step in the EM iterative equation updates the gauss hybrid models Number completes an iteration；

S44, judge whether gauss hybrid models training meets termination condition, if meeting termination condition, export text classification mould Type, if being unsatisfactory for termination condition, return step S42 continues the parameter for training gauss hybrid models, wherein the end item Part includes the first termination condition and/or the second termination condition, and first termination condition is the maximum that the number of iterations is greater than setting The number of iterations, the second termination condition are that the adjacent difference for being iterating through the predicted value that E is walked twice is less than set target value.

3. file classification method according to claim 1, which is characterized in that in the step S2, the height based on EM algorithm This mixed model is shown below:

Wherein, i indicates that the index of the affiliated text categories of training sample, x indicate the feature vector of training sample, and π indicates that Gauss is mixed The mixed coefficint of molding type, μ indicate that the mean vector of feature vector x, ∑ indicate covariance matrix, μ_iIndicate the i-th class training sample The mean vector of eigen vector, ∑_iIndicate the covariance matrix of the i-th class training sample feature vector, π_iIndicate Gaussian Mixture mould The mixed coefficint of type the i-th class training sample, and N (x | μ_i, ∑_i) indicate μ_iAnd ∑_iUnder the conditions of training sample x belong to the i-th class text Probability, p indicate the conditional probability of training sample x.

4. file classification method according to claim 3, which is characterized in that in step S3, the gauss hybrid models Parameter includes μ_i、∑_i、π_i, initialization solves the parameter of the gauss hybrid models according to the following formula:

Wherein, j indicates that the index of training sample, l indicate the quantity of category training sample, and n indicates the quantity of training sample, i Indicate that the index of the affiliated text categories of training sample, x indicate the feature vector of training sample, μ_iIndicate that the i-th class training sample is special Levy the mean vector of vector, ∑_iIndicate the covariance matrix of the i-th class training sample feature vector, π_iIndicate gauss hybrid models the The mixed coefficint of i class training sample, x_jIndicate the feature vector of j-th of training sample, γ_ijIndicate that j-th of training sample belongs to The probability value of i-th class text classification.

5. file classification method according to claim 2, which is characterized in that the E in the EM iterative equation is walked under The predicted value of no category training sample is calculated in formula:

Wherein, i indicates that the index of the affiliated text categories of training sample, j indicate that the index of training sample, m indicate training sample institute Belong to the total quantity of text categories, x_jIndicate the feature vector of j-th of training sample, μ_jIndicate the i-th class training sample feature vector Mean vector, ∑_iIndicate the covariance matrix of the i-th class training sample feature vector, π_iIndicate the training of the i-th class of gauss hybrid models The mixed coefficint of sample, N (x_j|μ_i, ∑_i) indicate μ_iAnd ∑_iUnder the conditions of j-th of training sample belong to the probability of the i-th class text, γ_ijIndicate that j-th of training sample belongs to the probability value of the i-th class text classification.

6. file classification method according to claim 5, which is characterized in that the M in the EM iterative equation is walked under Formula updates the parameter of the gauss hybrid models:

Wherein, i indicates that the index of the affiliated text categories of training sample, j indicate that the index of training sample, n indicate training sample Quantity, μ_iIndicate the mean vector of the i-th class training sample feature vector, ∑_iIndicate the association side of the i-th class training sample feature vector Poor matrix, π_iIndicate the mixed coefficint of the i-th class of gauss hybrid models training sample, x_jIndicate the feature of j-th of training sample to Amount, γ_ijIndicate that j-th of training sample belongs to the probability value of the i-th class text classification.

7. file classification method according to claim 1, which is characterized in that using textual classification model to text to be sorted This carries out classification

The probability value that data text to be sorted belongs to each text categories is exported by the textual classification model, it is maximum general It is text categories belonging to data text to be sorted that rate, which is worth corresponding text categories,.

8. a kind of electronic device, which is characterized in that the electronic device includes: processor；Memory includes text in the memory This sort program, the processor execute the text classification program, realize following steps:

Existing data set text is pre-processed, constructs training set, the training set includes category training sample and nothing Category training sample；

Construct the gauss hybrid models based on EM algorithm；

Classified using textual classification model to text to be sorted.

9. electronic device according to claim 8, which is characterized in that the processor is mixed with the EM algorithm training Gauss The parameter of molding type includes:

It walks to obtain the predicted value of the no category training sample by the E in the EM iterative equation and accordingly predicts category, Prediction category is introduced into training set, updates the training set；

Using updated training set, the parameter of the gauss hybrid models is updated by the M step in the EM iterative equation, it is complete At an iteration；

Judge whether gauss hybrid models training meets termination condition, if meeting termination condition, exports textual classification model, if It is unsatisfactory for termination condition, then continues the parameter for training gauss hybrid models, wherein the termination condition includes the first termination condition And/or second termination condition, first termination condition are the maximum number of iterations that the number of iterations is greater than setting, second terminates item Part is that the adjacent difference for being iterating through the predicted value that E is walked twice is less than set target value.

10. electronic device according to claim 9, which is characterized in that the E step in the EM iterative equation is counted according to the following formula Calculation obtains the predicted value of no category training sample:

Wherein, i indicates that the index of the affiliated text categories of training sample, j indicate that the index of training sample, m indicate training sample institute Belong to the total quantity of text categories, x_jIndicate the feature vector of j-th of training sample, μ_iIndicate the i-th class training sample feature vector Mean vector, ∑_iIndicate the covariance matrix of the i-th class training sample feature vector, π_iIndicate the training of the i-th class of gauss hybrid models The mixed coefficint of sample, N (x_j|μ_i, ∑_i) indicate μ_iAnd ∑_iUnder the conditions of j-th of training sample belong to the probability of the i-th class text, γ_ijIndicate that j-th of training sample belongs to the probability value of the i-th class text classification；