CN109492093A - File classification method and electronic device based on gauss hybrid models and EM algorithm - Google Patents

File classification method and electronic device based on gauss hybrid models and EM algorithm Download PDF

Info

Publication number
CN109492093A
CN109492093A CN201811159037.3A CN201811159037A CN109492093A CN 109492093 A CN109492093 A CN 109492093A CN 201811159037 A CN201811159037 A CN 201811159037A CN 109492093 A CN109492093 A CN 109492093A
Authority
CN
China
Prior art keywords
training sample
indicate
text
training
hybrid models
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN201811159037.3A
Other languages
Chinese (zh)
Inventor
金戈
徐亮
肖京
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN201811159037.3A priority Critical patent/CN109492093A/en
Publication of CN109492093A publication Critical patent/CN109492093A/en
Withdrawn legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention belongs to big data analysis technical fields, disclose a kind of file classification method based on gauss hybrid models and EM algorithm, applied to electronic device, the following steps are included: step S1, being pre-processed to existing data set text, training set is constructed, includes category training sample and without category training sample;Step S2, the gauss hybrid models based on EM algorithm are constructed;Step S3, category training sample initializes the parameter of the gauss hybrid models according to;Step S4, with the parameter of the EM algorithm training gauss hybrid models, textual classification model is obtained;Step S5, classified using textual classification model to text to be sorted.The present invention carries out semi-supervised learning to training sample, reduces the dependence to the data resource quantity for having category training sample, improves the precision of textual classification model using no category training sample, improve the accuracy of text classification.The invention also discloses a kind of electronic devices.

Description

File classification method and electronic device based on gauss hybrid models and EM algorithm
Technical field
The present invention relates to big data analysis technical fields, more particularly to the text based on gauss hybrid models and EM algorithm point Class method and electronic device.
Background technique
Text classification is mainly used for information retrieval, machine translation, automatic abstract and information filtering etc..With information technology Development, data are in explosive growth, the feature with higher-dimension and mass data, textual classification model need a large amount of mark samples into Row training, but having marked of the providing information that sample can be provided may be subjective and limited, and equally, not marking sample may contain Text distributed intelligence abundant.Currently, classified using supervised learning model to text, but the precision of supervised learning model Need to rely on the data resource quantity containing mark, and existing NB Algorithm is simple and efficient, compared with other sorting algorithms Time complexity is low, high-efficient, and is widely used in classification task, but NB Algorithm is in processing mass text When classification data, accuracy rate is likewise dependent upon the training data containing mark, has marked so that training pattern precision depends on Training sample, the precision of training pattern is low, influences classifying quality.
Summary of the invention
The present invention provides a kind of file classification method and electronic device based on gauss hybrid models and EM algorithm, to solve Training pattern precision is higher to the dependence for having marked training sample, and the problem of influence classifying quality, so as to by mentioning Rise accuracy of the model accuracy raising to text classification.
To achieve the goals above, it is an aspect of the invention to provide a kind of based on gauss hybrid models and EM algorithm File classification method is applied to electronic device, comprising the following steps:
Step S1, existing data set text is pre-processed, constructs training set, the training set includes category instruction Practice sample and without category training sample;
Step S2, the gauss hybrid models based on EM algorithm are constructed;
Step S3, category training sample initializes the parameter of the gauss hybrid models according to;
Step S4, with the parameter of the EM algorithm training gauss hybrid models, textual classification model is obtained;
Step S5, classified using textual classification model to text to be sorted.
Preferably, the step S4 includes:
S41, the gauss hybrid models parameter by initialization is substituted into EM iterative equation;
S42, the predicted value for walking to obtain the no category training sample by the E in the EM iterative equation and corresponding pre- Category is surveyed, prediction category is introduced into training set, updates the training set;
S43, using updated training set, the M step passed through in the EM iterative equation updates the gauss hybrid models Parameter, complete an iteration;
S44, judge whether gauss hybrid models training meets termination condition, if meeting termination condition, export text point Class model, if being unsatisfactory for termination condition, return step S42 continues the parameter for training gauss hybrid models, wherein the knot Beam condition includes the first termination condition and/or the second termination condition, and first termination condition is that the number of iterations is greater than setting Maximum number of iterations, the second termination condition are that the adjacent difference for being iterating through the predicted value that E is walked twice is less than setting target Value.
Preferably, in the step S2, the gauss hybrid models based on EM algorithm are shown below:
Wherein, i indicates that the index of the affiliated text categories of training sample, x indicate the feature vector of training sample, and π indicates high The mixed coefficint of this mixed model, μ indicate that the mean vector of feature vector x, ∑ indicate covariance matrix, μiIndicate that the i-th class is instructed Practice the mean vector of sampling feature vectors, ∑iIndicate the covariance matrix of the i-th class training sample feature vector, πiIndicate that Gauss is mixed The mixed coefficint of molding type the i-th class training sample, and N (x | μi,∑i) indicate μiAnd ∑iUnder the conditions of training sample x belong to the i-th class text This probability, p indicate the conditional probability of training sample x.
Preferably, in step S3, the parameter of the gauss hybrid models includes μi、∑i、πi, initialization solves according to the following formula The parameter of the gauss hybrid models:
Wherein, j indicates that the index of training sample, l indicate the quantity of category training sample, and n indicates the number of training sample Amount, i indicate that the index of the affiliated text categories of training sample, x indicate the feature vector of training sample, μiIndicate the i-th class training sample The mean vector of eigen vector, ∑iIndicate the covariance matrix of the i-th class training sample feature vector, πiIndicate Gaussian Mixture mould The mixed coefficint of type the i-th class training sample, xjIndicate the feature vector of j-th of training sample, γijIndicate j-th of training sample Belong to the probability value of the i-th class text classification.
Preferably, the predicted value of no category training sample is calculated in the E step in the EM iterative equation according to the following formula:
Wherein, i indicates that the index of the affiliated text categories of training sample, j indicate that the index of training sample, m indicate training sample The total quantity of text categories belonging to this, xjIndicate the feature vector of j-th of training sample, μiIndicate the i-th class training sample feature to The mean vector of amount, ∑iIndicate the covariance matrix of the i-th class training sample feature vector, πiIndicate the i-th class of gauss hybrid models The mixed coefficint of training sample, N (xji,∑i) indicate μiAnd ∑iUnder the conditions of j-th of training sample belong to the general of the i-th class text Rate, γijIndicate that j-th of training sample belongs to the probability value of the i-th class text classification.
Preferably, the M step in the EM iterative equation updates the parameter of the gauss hybrid models according to the following formula:
Wherein, i indicates that the index of the affiliated text categories of training sample, j indicate that the index of training sample, n indicate training sample This quantity, μiIndicate the mean vector of the i-th class training sample feature vector, ∑iIndicate the i-th class training sample feature vector Covariance matrix, πiIndicate the mixed coefficint of the i-th class of gauss hybrid models training sample, xjIndicate the feature of j-th of training sample Vector, γijIndicate that j-th of training sample belongs to the probability value of the i-th class text classification.
Preferably, carrying out classification to text to be sorted using textual classification model includes:
Data text to be sorted is pre-processed, according to term vector library, converts term vector for data text;
The corresponding feature vector of data text is obtained according to term vector, the input as textual classification model;
It will be in the corresponding feature vector input textual classification model of data text;
The probability value that data text to be sorted belongs to each text categories is exported by the textual classification model, it is maximum The corresponding text categories of probability value be text categories belonging to data text to be sorted.
To achieve the goals above, another aspect of the present invention is just to provide a kind of electronic device, comprising: processor; Memory includes text classification program in the memory, and the processor executes the text classification program, realizes following step It is rapid:
Existing data set text is pre-processed, constructs training set, the training set includes category training sample With no category training sample;
Construct the gauss hybrid models based on EM algorithm;
According to the parameter for thering is category training sample to initialize the gauss hybrid models;
With the parameter of the EM algorithm training gauss hybrid models, textual classification model is obtained;
Classified using textual classification model to text to be sorted.
Preferably, the processor includes: with the parameter of the EM algorithm training gauss hybrid models
Gauss hybrid models parameter by initialization is substituted into EM iterative equation;
It walks to obtain the predicted value of the no category training sample by the E in the EM iterative equation and accordingly predicts class Prediction category is introduced training set, updates the training set by mark;
Using updated training set, the ginseng of the gauss hybrid models is updated by the M step in the EM iterative equation Number completes an iteration;
Judge whether gauss hybrid models training meets termination condition, if meeting termination condition, exports text classification mould Type continues the parameter for training gauss hybrid models if being unsatisfactory for termination condition, wherein the termination condition includes the first knot Beam condition and/or the second termination condition, maximum number of iterations of first termination condition for the number of iterations greater than setting, second Termination condition is that the adjacent difference for being iterating through the predicted value that E is walked twice is less than set target value.
Preferably, the predicted value of no category training sample is calculated in the E step in the EM iterative equation according to the following formula:
Wherein, i indicates that the index of the affiliated text categories of training sample, j indicate that the index of training sample, m indicate training sample The total quantity of text categories belonging to this, xjIndicate the feature vector of j-th of training sample, μiIndicate the i-th class training sample feature to The mean vector of amount, ∑iIndicate the covariance matrix of the i-th class training sample feature vector, πiIndicate the i-th class of gauss hybrid models The mixed coefficint of training sample, N (xji,∑i) indicate μiAnd ∑iUnder the conditions of j-th of training sample belong to the general of the i-th class text Rate, γijIndicate that j-th of training sample belongs to the probability value of the i-th class text classification;
M step in the EM iterative equation updates the parameter of the gauss hybrid models according to the following formula:
Wherein, i indicates that the index of the affiliated text categories of training sample, j indicate that the index of training sample, n indicate training sample This quantity, μiIndicate the mean vector of the i-th class training sample feature vector, ∑iIndicate the i-th class training sample feature vector Covariance matrix, πiIndicate the mixed coefficint of the i-th class of gauss hybrid models training sample, xjIndicate the feature of j-th of training sample Vector, γijIndicate that j-th of training sample belongs to the probability value of the i-th class text classification.
Compared with the existing technology, the present invention has the following advantages and beneficial effects:
File classification method and electronic device of the present invention based on gauss hybrid models and EM algorithm, by establishing base In the gauss hybrid models of EM algorithm, completed using gauss hybrid models and EM algorithm pre- for the mark of no labeled data collection It surveys, realizes the semi-supervised learning to training sample, reduce dependence of the training pattern to the training dataset containing mark, make full use of Text is effectively performed to improve the accuracy to text classification in precision without the further training for promotion model of labeled data Classification.
Detailed description of the invention
Fig. 1 is the flow diagram of file classification method of the present invention;
Fig. 2 is the module diagram of this Chinese sort program of the present invention.
The embodiments will be further described with reference to the accompanying drawings for the realization, the function and the advantages of the object of the present invention.
Specific embodiment
Embodiment of the present invention described below with reference to the accompanying drawings.Those skilled in the art may recognize that It arrives, it without departing from the spirit and scope of the present invention, can be with a variety of different modes or combinations thereof to described Embodiment is modified.Therefore, attached drawing and description are regarded as illustrative in nature, and are only used to explain the present invention, rather than are used In limitation scope of protection of the claims.In addition, in the present specification, attached drawing is drawn not in scale, and identical attached drawing mark Note indicates identical part.
Fig. 1 is the flow diagram of file classification method of the present invention, as shown in Figure 1, of the present invention be based on Gauss The file classification method of mixed model and EM algorithm is applied to electronic device, comprising the following steps:
Step S1, existing data set text is pre-processed, constructs training set, the training set includes category instruction Practice sample and without category training sample, wherein there is category training sample to indicate that the training sample has classification mark corresponding thereto Label, no category training sample indicate the class label of the training sample not corresponding thereto, and category is belonging to training sample The abbreviation of class label, it is following to be referred to as category;
Step S2, the gauss hybrid models based on EM algorithm are constructed;
Step S3, category training sample initializes the parameter of the gauss hybrid models according to;
Step S4, with the parameter of the EM algorithm training gauss hybrid models, textual classification model is obtained;
Step S5, classified using textual classification model to text to be sorted.
The present invention utilizes gauss hybrid models and the completion pair of EM algorithm by establishing the gauss hybrid models based on EM algorithm It is predicted in the mark of no labeled data collection, obtains textual classification model, realize the semi-supervised learning to training sample, reduced pair There is the dependence of the data resource quantity of category training sample, improves the essence of textual classification model using no category training sample Degree, improves the accuracy of text classification.
Preferably, the step S4 includes:
S41, the gauss hybrid models parameter by initialization is substituted into EM iterative equation;
S42, the predicted value for walking to obtain the no category training sample by the E in the EM iterative equation and corresponding pre- Category is surveyed, prediction category is introduced into training set, updates the training set;
S43, using updated training set, the M step passed through in the EM iterative equation updates the gauss hybrid models Parameter, complete an iteration;
S44, judge whether gauss hybrid models training meets termination condition, if meeting termination condition, export text point Class model returns to the step S42 if being unsatisfactory for termination condition, continues the parameter for training gauss hybrid models, wherein institute Stating termination condition includes the first termination condition and/or the second termination condition, and first termination condition is greater than for the number of iterations to be set Fixed maximum number of iterations, the second termination condition are that the adjacent difference for being iterating through the predicted value that E is walked twice is less than setting Target value.
The category of no category training sample is updated by E step, is carried out by parameter of the M step to gauss hybrid models It updates, no category training sample is utilized and carries out model learning, improves the precision of model.
In one embodiment of the present of invention, the step S1 includes: building term vector library;Data set text is divided Word, word frequency statistics and duplicate removal convert term vector for data set text according to the term vector library;From data with existing collection text Middle selection training sample includes category training sample and without category training sample, according to the corresponding term vector of training sample Obtain the feature vector of training sample;Training set is constructed according to the feature vector of training sample and corresponding category.For example, from Have and select n data text therein as training sample in data set text, obtain the corresponding feature vector of training text with And category constructs to form training set, wherein in n training sample, including l have category training sample and u a without category instruction Practice sample, then the training set constructed is D={ (x1,y1),(x2,y2),…,(xl,yl),xl+1,xl+2,…,xl+u, x indicates training The feature vector of sample, y indicate the category of training sample.
It should be noted that various term vector models can be used by converting term vector for data set text in the present invention, For example, Word2Vec model, CBOW model etc..
In the processing to data set text, duplicate removal is in order to delete duplicate keyword, to avoid belonging to a different category Text in include identical keyword, influence classification results.
In one embodiment of the present of invention, the corresponding term vector of training sample is sought into mean value according to vector dimension, is obtained To the feature vector of training sample.
Preferably, in the step S2, the gauss hybrid models based on EM algorithm are shown below:
Wherein, i indicates that the index of the affiliated text categories of training sample, x indicate the feature vector of training sample, and π indicates high The mixed coefficint of this mixed model, μ indicate that the mean vector of feature vector x, ∑ indicate covariance matrix, μiIndicate that the i-th class is instructed Practice the mean vector of sampling feature vectors, ∑iIndicate the covariance matrix of the i-th class training sample feature vector, Nix | μi,∑i) Indicate μiAnd ∑iUnder the conditions of training sample x belong to the probability of the i-th class text, p indicates the conditional probability of training sample x.
Preferably, in step S3, according to the parameter for having category training sample initialization gauss hybrid models, the parameter packet Include μi、∑i、πi, the initial parameter of the gauss hybrid models is solved according to the following formula:
Wherein, j indicates that the index of training sample, l indicate the quantity of category training sample, and n indicates the number of training sample Amount, i indicate that the index of the affiliated text categories of training sample, x indicate the feature vector of training sample, xjIndicate j-th of trained sample This feature vector, γijIndicate that j-th of training sample belongs to the probability value of the i-th class text classification.
Preferably, in the step S42, no category training sample is calculated in the E step in EM iterative equation according to the following formula Predicted value:
Wherein, i indicates that the index of the affiliated text categories of training sample, j indicate that the index of training sample, m indicate training sample The total quantity of text categories belonging to this, xjIndicate the feature vector of j-th of training sample, N (xji,∑i) indicate μiAnd ∑iItem J-th of training sample belongs to the probability of the i-th class text, γ under partijIndicate that j-th of training sample belongs to the i-th class text classification Probability value.
The γ that E is walkedijAs the predicted value of each no category training sample, according to γijSize, by γijIt is maximum Value determine j-th without text categories belonging to category training sample, using this text categories as this without category training sample It predicts category, and prediction category is introduced into training set, so that the training sample in training set contains category, with updated The training of gauss hybrid models parameter is carried out based on training set.For example, the mixed Gauss model according to obtained in step S3 Initial parameter μi、∑i、πiAnd without category training sample { x in training setl+1,xl+2,…,xl+u, the prediction result walked by E It obtains and no category training sample { xl+1,xl+2,…,xl+uCorresponding prediction class is designated as { yl+1,yl+2,…,yl+u, it will predict Category introduces training set, obtains updated training set D '={ (x1,y1),(x2,y2),…,(xl,yl),(xl+1,yl+1), (xl+2,yl+2),…,(xl+u,yl+u)}。
Preferably, in the step S43, the M step in the EM iterative equation updates the Gaussian Mixture mould according to the following formula The parameter of type:
Wherein, i indicates that the index of the affiliated text categories of training sample, j indicate that the index of training sample, n indicate training sample This quantity, xjIndicate the feature vector of j-th of training sample, γijIndicate that j-th of training sample belongs to the i-th class text classification Probability value.For having category training sample i.e. j ∈ { 1,2 ..., l }, when training sample generic is corresponding with category, γij Value is 1, training sample generic and category not to it is corresponding when γijValue is 0, for example, when for j ∈ { 1,2 ..., l }, if only 1st and the 2nd training sample generic are the i-th class, then γi1=1, γi2=1, and remaining γij=0, j ∈ 3, 4,…,l}.For no category training sample, that is, j ∈ { l+1, l+2 ..., n }, γijValue walks formula calculating according to E and obtains.
Training set the D '={ (x updated according to step S421,y1),(x2,y2),…,(xl,yl),(xl+1,yl+1), (xl+2,yl+2),…,(xl+u,yl+u) update for carrying out gauss hybrid models parameter is walked by M.
No category training sample is predicted by E step using updated gauss hybrid models, obtains no category instruction The prediction category for practicing sample, updates training set again, updates gauss hybrid models again by M step using updated training set Parameter, circuit sequentially the E step and M step of EM iterative equation, until meeting training termination condition, the parameter of gauss hybrid models becomes In stabilization, textual classification model is exported.
Preferably, in step S5, carrying out classification to text to be sorted using textual classification model includes:
Data text to be sorted is pre-processed, according to term vector library, converts term vector for data text;
The corresponding feature vector of data text is obtained according to term vector, the input as textual classification model;
It will be in the corresponding feature vector input textual classification model of data text;
The probability value that data text to be sorted belongs to each text categories is exported by the textual classification model, according to The size of probability value determines text categories belonging to data text to be sorted, wherein the corresponding text class of maximum probability value Text categories belonging to data text that Ji Wei be not to be sorted.
File classification method of the present invention based on gauss hybrid models and EM algorithm is applied to electronic device, the electricity Sub-device can be the terminal device that smart phone, tablet computer, computer etc. have calculation function.
The electronic device includes: processor;Memory includes text classification program, the processing in the memory Device executes the text classification program, realizes following steps:
Existing data set text is pre-processed, constructs training set, the training set includes category training sample With no category training sample;
Construct the gauss hybrid models based on EM algorithm;
According to the parameter for thering is category training sample to initialize the gauss hybrid models;
With the parameter of the EM algorithm training gauss hybrid models, textual classification model is obtained;
Classified using textual classification model to text to be sorted.
In the present invention, processor is for the storage program in run memory, to realize text classification, for example, processor It can be with central processing unit, microprocessor or other data processing chips.
In the present invention, memory is used for the program that storage processor needs to be implemented, and readable including at least one type is deposited Storage media, for example, the non-volatile memory mediums such as flash memory, hard disk.Memory can be the internal storage unit of electronic device, It can be external memory, such as plug-in type hard disk, flash card or other kinds of storage card etc..The present invention is not limited to This, memory can be in a manner of non-transitory store instruction or software and any associated data file and to processor Instruction or software program is provided so that the processor is able to carry out any device of instruction or software program.
Electronic device of the present invention completes the mark for no labeled data collection using gauss hybrid models and EM algorithm Prediction obtains textual classification model, realizes to the semi-supervised learning of training sample, reduces to the data for having category training sample The dependence of resource quantity improves the precision of textual classification model using no category training sample, improves the accurate of text classification Degree.
In the present invention, processor executes text classification program, with the parameter packet of the EM algorithm training gauss hybrid models It includes:
Gauss hybrid models parameter by initialization is substituted into EM iterative equation;
It walks to obtain the predicted value of the no category training sample by the E in the EM iterative equation and accordingly predicts class Prediction category is introduced training set, updates the training set by mark;
Using updated training set, the ginseng of the gauss hybrid models is updated by the M step in the EM iterative equation Number completes an iteration;
Judge whether gauss hybrid models training meets termination condition, if meeting termination condition, exports text classification mould Type continues the parameter for training gauss hybrid models if being unsatisfactory for termination condition, wherein the termination condition includes the first knot Beam condition and/or the second termination condition, maximum number of iterations of first termination condition for the number of iterations greater than setting, second Termination condition is that the adjacent difference for being iterating through the predicted value that E is walked twice is less than set target value.
In one embodiment of the present of invention, existing data set text is pre-processed, building training set includes: building Term vector library;Data set text is segmented, word frequency statistics and duplicate removal turn data set text according to the term vector library Turn to term vector;Training sample is selected from data with existing collection text, includes category training sample and without category training sample, The feature vector of training sample is obtained according to the corresponding term vector of training sample;According to the feature vector and correspondence of training sample Category construct training set.For example, selecting n data text therein as training sample from data with existing collection text, obtain The corresponding feature vector of training text and category is taken to construct to form training set, wherein in n training sample, including l have Category training sample and a training set without category training sample, then constructed of u are D={ (x1,y1),(x2,y2),…,(xl,yl), xl+1,xl+2,…,xl+u, x indicates the feature vector of training sample, and y indicates the category of training sample.
It should be noted that various term vector models can be used by converting term vector for data set text in the present invention, For example, Word2Vec model, CBOW model etc..
In the processing to data set text, duplicate removal is in order to delete duplicate keyword, to avoid belonging to a different category Text in include identical keyword, influence classification results.
In one embodiment of the present of invention, the corresponding term vector of training sample is sought into mean value according to vector dimension, is obtained To the feature vector of training sample.
The gauss hybrid models for being preferably based on EM algorithm are shown below:
Wherein, i indicates that the index of the affiliated text categories of training sample, x indicate the feature vector of training sample, and π indicates high The mixed coefficint of this mixed model, μ indicate that the mean vector of feature vector x, ∑ indicate covariance matrix, μiIndicate that the i-th class is instructed Practice the mean vector of sampling feature vectors, ∑iIndicate the covariance matrix of the i-th class training sample feature vector, and N (x | μi, ∑i) Indicate μiAnd ∑iUnder the conditions of training sample x belong to the probability of the i-th class text, p indicates the conditional probability of training sample x.
Preferably, according to the parameter for having category training sample initialization gauss hybrid models, the parameter includes μi、∑i、 πi, the initial parameter of the gauss hybrid models is solved according to the following formula:
Wherein, j indicates that the index of training sample, l indicate the quantity of category training sample, and n indicates the number of training sample Amount, i indicate that the index of the affiliated text categories of training sample, x indicate the feature vector of training sample, xjIndicate j-th of trained sample This feature vector, γijIndicate that j-th of training sample belongs to the probability value of the i-th class text classification.
Preferably, the predicted value of no category training sample is calculated in the E step in EM iterative equation according to the following formula:
Wherein, i indicates that the index of the affiliated text categories of training sample, j indicate that the index of training sample, m indicate training sample The total quantity of text categories belonging to this, xjIndicate the feature vector of j-th of training sample, N (xji, ∑i) indicate μiAnd ∑iItem J-th of training sample belongs to the probability of the i-th class text, γ under partijIndicate that j-th of training sample belongs to the i-th class text classification Probability value.
The γ that E is walkedijAs the predicted value of each no category training sample, according to γijSize, by γijIt is maximum Value determine j-th without text categories belonging to category training sample, using this text categories as this without category training sample It predicts category, and prediction category is introduced into training set, so that the training sample in training set contains category, with updated The training of gauss hybrid models parameter is carried out based on training set.For example, according to the initial parameter μ of mixed Gauss modeli、∑i、 πiAnd without category training sample { x in training setl+1,xl+2,…,xl+u, it obtains instructing with no category by the prediction result that E is walked Practice sample { xl+1,xl+2,…,xl+uCorresponding prediction class is designated as { yl+1,yl+2,…,yl+u, prediction category is introduced into training Collection, obtains updated training set D '={ (x1,y1),(x2,y2),…,(xl,yl),(xl+1,yl+1),(xl+2,yl+2),…, (xl+u,yl+u)}。
Preferably, the M step in the EM iterative equation updates the parameter of the gauss hybrid models according to the following formula:
Wherein, i indicates that the index of the affiliated text categories of training sample, j indicate that the index of training sample, n indicate training sample This quantity, xjIndicate the feature vector of j-th of training sample, γijIndicate that j-th of training sample belongs to the i-th class text classification Probability value.For having category training sample i.e. j ∈ { 1,2 ..., l }, when training sample generic is corresponding with category, γij Value is 1, training sample generic and category not to it is corresponding when γijValue is 0, for example, when for j ∈ { 1,2 ..., l }, if only 1st and the 2nd training sample generic are the i-th class, then γi1=1, γi2=1, and remaining γij=0, j ∈ 3, 4 ..., l }.For no category training sample, that is, j ∈ { l+1, l+2 ..., n }, γijValue walks formula calculating according to E and obtains.
Training set the D '={ (x obtained according to update1,y1),(x2,y2),…,(xl,yl),(xl+1,yl+1),(xl+2, yl+2),…,(xl+u,yl+u) update for carrying out gauss hybrid models parameter is walked by M.
No category training sample is predicted by E step using updated gauss hybrid models, obtains no category instruction The prediction category for practicing sample, updates training set again, updates gauss hybrid models again by M step using updated training set Parameter, circuit sequentially the E step and M step of EM iterative equation, until meeting training termination condition, the parameter of gauss hybrid models becomes In stabilization, textual classification model is exported.
Preferably, carrying out classification to text to be sorted using textual classification model includes:
Data text to be sorted is pre-processed, according to term vector library, converts term vector for data text;
The corresponding feature vector of data text is obtained according to term vector, the input as textual classification model;
It will be in the corresponding feature vector input textual classification model of data text;
The probability value that data text to be sorted belongs to each text categories is exported by the textual classification model, according to The size of probability value determines text categories belonging to data text to be sorted, wherein the corresponding text class of maximum probability value Text categories belonging to data text that Ji Wei be not to be sorted.
In one embodiment of the invention, text classification program can be divided into one or more modules, one or Multiple modules are stored in memory, and are executed by processor, to realize text classification.Module of the present invention is can be complete At the series of computation machine program instruction section of specific function.Fig. 2 is the module diagram of this Chinese sort program of the present invention, is such as schemed Shown in 2, training set obtains module 1, model construction module 2, initialization module 3, model training module 4, categorization module 5, each The functions or operations step that module is realized is similar as above, and and will not be described here in detail, illustratively, such as wherein:
Training set obtains module 1, pre-processes to existing data set text, constructs training set, the training set packet Category training sample is included and without category training sample;
Model construction module 2 constructs the gauss hybrid models based on EM algorithm;
Initialization module 3, according to the parameter for thering is category training sample to initialize the gauss hybrid models;
Model training module 4 obtains textual classification model with the parameter of the EM algorithm training gauss hybrid models;
Categorization module 5 classifies to text to be sorted using textual classification model;
Further, the model training module 4 includes:
Gauss hybrid models parameter by initialization is substituted into EM iterative equation by parameter input unit 41;
Category predicting unit 42 walks to obtain the prediction of the no category training sample by the E in the EM iterative equation Value and category is predicted accordingly, prediction category is introduced into training set, updates the training set;
Parameter updating unit 43 updates the height by the M step in the EM iterative equation using updated training set The parameter of this mixed model completes an iteration;
Judging unit 44, judges whether gauss hybrid models training meets termination condition, if meeting termination condition, exports Textual classification model recycles the E step and M step of EM iterative equation, the ginseng of training gauss hybrid models if being unsatisfactory for termination condition Number, wherein the termination condition includes the first termination condition and/or the second termination condition, and first termination condition is iteration Number is greater than the maximum number of iterations of setting, and the second termination condition is the adjacent difference for being iterating through the predicted value that E is walked twice Value is less than set target value.
It should be noted that, in this document, the terms "include", "comprise" or its any other variant are intended to non-row His property includes, so that the process, device, article or the method that include a series of elements not only include those elements, and And further include other elements that are not explicitly listed, or further include for this process, device, article or method institute it is intrinsic Element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that including being somebody's turn to do There is also other identical elements in the process, device of element, article or method.
Through the above description of the embodiments, those skilled in the art can be understood that above-described embodiment side Method can be realized by means of software and necessary general hardware platform, naturally it is also possible to by hardware, but in many cases The former is more preferably embodiment.Based on this understanding, technical solution of the present invention substantially in other words does the prior art The part contributed out can be embodied in the form of software products, which is stored in one as described above In storage medium (such as ROM/RAM, magnetic disk, CD), including some instructions are used so that terminal device (it can be mobile phone, Computer, server or network equipment etc.) execute method described in each embodiment of the present invention.
The above is only a preferred embodiment of the present invention, is not intended to limit the scope of the invention, all to utilize this hair Equivalent structure or equivalent flow shift made by bright specification and accompanying drawing content is applied directly or indirectly in other relevant skills Art field, is included within the scope of the present invention.

Claims (10)

1. a kind of file classification method based on gauss hybrid models and EM algorithm is applied to electronic device, which is characterized in that packet Include following steps:
Step S1, existing data set text is pre-processed, constructs training set, the training set includes category training sample Originally and without category training sample;
Step S2, the gauss hybrid models based on EM algorithm are constructed;
Step S3, category training sample initializes the parameter of the gauss hybrid models according to;
Step S4, with the parameter of the EM algorithm training gauss hybrid models, textual classification model is obtained;
Step S5, classified using textual classification model to text to be sorted.
2. file classification method according to claim 1, which is characterized in that the step S4 includes:
S41, the gauss hybrid models parameter by initialization is substituted into EM iterative equation;
S42, the predicted value for walking to obtain the no category training sample by the E in the EM iterative equation and class is predicted accordingly Prediction category is introduced training set, updates the training set by mark;
S43, using updated training set, pass through the ginseng that the M step in the EM iterative equation updates the gauss hybrid models Number completes an iteration;
S44, judge whether gauss hybrid models training meets termination condition, if meeting termination condition, export text classification mould Type, if being unsatisfactory for termination condition, return step S42 continues the parameter for training gauss hybrid models, wherein the end item Part includes the first termination condition and/or the second termination condition, and first termination condition is the maximum that the number of iterations is greater than setting The number of iterations, the second termination condition are that the adjacent difference for being iterating through the predicted value that E is walked twice is less than set target value.
3. file classification method according to claim 1, which is characterized in that in the step S2, the height based on EM algorithm This mixed model is shown below:
Wherein, i indicates that the index of the affiliated text categories of training sample, x indicate the feature vector of training sample, and π indicates that Gauss is mixed The mixed coefficint of molding type, μ indicate that the mean vector of feature vector x, ∑ indicate covariance matrix, μiIndicate the i-th class training sample The mean vector of eigen vector, ∑iIndicate the covariance matrix of the i-th class training sample feature vector, πiIndicate Gaussian Mixture mould The mixed coefficint of type the i-th class training sample, and N (x | μi, ∑i) indicate μiAnd ∑iUnder the conditions of training sample x belong to the i-th class text Probability, p indicate the conditional probability of training sample x.
4. file classification method according to claim 3, which is characterized in that in step S3, the gauss hybrid models Parameter includes μi、∑i、πi, initialization solves the parameter of the gauss hybrid models according to the following formula:
Wherein, j indicates that the index of training sample, l indicate the quantity of category training sample, and n indicates the quantity of training sample, i Indicate that the index of the affiliated text categories of training sample, x indicate the feature vector of training sample, μiIndicate that the i-th class training sample is special Levy the mean vector of vector, ∑iIndicate the covariance matrix of the i-th class training sample feature vector, πiIndicate gauss hybrid models the The mixed coefficint of i class training sample, xjIndicate the feature vector of j-th of training sample, γijIndicate that j-th of training sample belongs to The probability value of i-th class text classification.
5. file classification method according to claim 2, which is characterized in that the E in the EM iterative equation is walked under The predicted value of no category training sample is calculated in formula:
Wherein, i indicates that the index of the affiliated text categories of training sample, j indicate that the index of training sample, m indicate training sample institute Belong to the total quantity of text categories, xjIndicate the feature vector of j-th of training sample, μjIndicate the i-th class training sample feature vector Mean vector, ∑iIndicate the covariance matrix of the i-th class training sample feature vector, πiIndicate the training of the i-th class of gauss hybrid models The mixed coefficint of sample, N (xji, ∑i) indicate μiAnd ∑iUnder the conditions of j-th of training sample belong to the probability of the i-th class text, γijIndicate that j-th of training sample belongs to the probability value of the i-th class text classification.
6. file classification method according to claim 5, which is characterized in that the M in the EM iterative equation is walked under Formula updates the parameter of the gauss hybrid models:
Wherein, i indicates that the index of the affiliated text categories of training sample, j indicate that the index of training sample, n indicate training sample Quantity, μiIndicate the mean vector of the i-th class training sample feature vector, ∑iIndicate the association side of the i-th class training sample feature vector Poor matrix, πiIndicate the mixed coefficint of the i-th class of gauss hybrid models training sample, xjIndicate the feature of j-th of training sample to Amount, γijIndicate that j-th of training sample belongs to the probability value of the i-th class text classification.
7. file classification method according to claim 1, which is characterized in that using textual classification model to text to be sorted This carries out classification
Data text to be sorted is pre-processed, according to term vector library, converts term vector for data text;
The corresponding feature vector of data text is obtained according to term vector, the input as textual classification model;
It will be in the corresponding feature vector input textual classification model of data text;
The probability value that data text to be sorted belongs to each text categories is exported by the textual classification model, it is maximum general It is text categories belonging to data text to be sorted that rate, which is worth corresponding text categories,.
8. a kind of electronic device, which is characterized in that the electronic device includes: processor;Memory includes text in the memory This sort program, the processor execute the text classification program, realize following steps:
Existing data set text is pre-processed, constructs training set, the training set includes category training sample and nothing Category training sample;
Construct the gauss hybrid models based on EM algorithm;
According to the parameter for thering is category training sample to initialize the gauss hybrid models;
With the parameter of the EM algorithm training gauss hybrid models, textual classification model is obtained;
Classified using textual classification model to text to be sorted.
9. electronic device according to claim 8, which is characterized in that the processor is mixed with the EM algorithm training Gauss The parameter of molding type includes:
Gauss hybrid models parameter by initialization is substituted into EM iterative equation;
It walks to obtain the predicted value of the no category training sample by the E in the EM iterative equation and accordingly predicts category, Prediction category is introduced into training set, updates the training set;
Using updated training set, the parameter of the gauss hybrid models is updated by the M step in the EM iterative equation, it is complete At an iteration;
Judge whether gauss hybrid models training meets termination condition, if meeting termination condition, exports textual classification model, if It is unsatisfactory for termination condition, then continues the parameter for training gauss hybrid models, wherein the termination condition includes the first termination condition And/or second termination condition, first termination condition are the maximum number of iterations that the number of iterations is greater than setting, second terminates item Part is that the adjacent difference for being iterating through the predicted value that E is walked twice is less than set target value.
10. electronic device according to claim 9, which is characterized in that the E step in the EM iterative equation is counted according to the following formula Calculation obtains the predicted value of no category training sample:
Wherein, i indicates that the index of the affiliated text categories of training sample, j indicate that the index of training sample, m indicate training sample institute Belong to the total quantity of text categories, xjIndicate the feature vector of j-th of training sample, μiIndicate the i-th class training sample feature vector Mean vector, ∑iIndicate the covariance matrix of the i-th class training sample feature vector, πiIndicate the training of the i-th class of gauss hybrid models The mixed coefficint of sample, N (xji, ∑i) indicate μiAnd ∑iUnder the conditions of j-th of training sample belong to the probability of the i-th class text, γijIndicate that j-th of training sample belongs to the probability value of the i-th class text classification;
M step in the EM iterative equation updates the parameter of the gauss hybrid models according to the following formula:
Wherein, i indicates that the index of the affiliated text categories of training sample, j indicate that the index of training sample, n indicate training sample Quantity, μiIndicate the mean vector of the i-th class training sample feature vector, ∑iIndicate the association side of the i-th class training sample feature vector Poor matrix, πiIndicate the mixed coefficint of the i-th class of gauss hybrid models training sample, xjIndicate the feature of j-th of training sample to Amount, γijIndicate that j-th of training sample belongs to the probability value of the i-th class text classification.
CN201811159037.3A 2018-09-30 2018-09-30 File classification method and electronic device based on gauss hybrid models and EM algorithm Withdrawn CN109492093A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811159037.3A CN109492093A (en) 2018-09-30 2018-09-30 File classification method and electronic device based on gauss hybrid models and EM algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811159037.3A CN109492093A (en) 2018-09-30 2018-09-30 File classification method and electronic device based on gauss hybrid models and EM algorithm

Publications (1)

Publication Number Publication Date
CN109492093A true CN109492093A (en) 2019-03-19

Family

ID=65690068

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811159037.3A Withdrawn CN109492093A (en) 2018-09-30 2018-09-30 File classification method and electronic device based on gauss hybrid models and EM algorithm

Country Status (1)

Country Link
CN (1) CN109492093A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110363359A (en) * 2019-07-23 2019-10-22 中国联合网络通信集团有限公司 A kind of occupation prediction technique and system
CN110400610A (en) * 2019-06-19 2019-11-01 西安电子科技大学 Small sample clinical data classification method and system based on multichannel random forest
CN110457467A (en) * 2019-07-02 2019-11-15 厦门美域中央信息科技有限公司 A kind of information technology file classification method based on gauss hybrid models
CN110705592A (en) * 2019-09-03 2020-01-17 平安科技(深圳)有限公司 Classification model training method, device, equipment and computer readable storage medium
CN111475648A (en) * 2020-03-30 2020-07-31 东软集团股份有限公司 Text classification model generation method, text classification method, device and equipment
CN112100377A (en) * 2020-09-14 2020-12-18 腾讯科技(深圳)有限公司 Text classification method and device, computer equipment and storage medium
CN112115268A (en) * 2020-09-28 2020-12-22 支付宝(杭州)信息技术有限公司 Training method and device and classification method and device based on feature encoder

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110400610A (en) * 2019-06-19 2019-11-01 西安电子科技大学 Small sample clinical data classification method and system based on multichannel random forest
CN110400610B (en) * 2019-06-19 2022-04-15 西安电子科技大学 Small sample clinical data classification method and system based on multichannel random forest
CN110457467A (en) * 2019-07-02 2019-11-15 厦门美域中央信息科技有限公司 A kind of information technology file classification method based on gauss hybrid models
CN110363359A (en) * 2019-07-23 2019-10-22 中国联合网络通信集团有限公司 A kind of occupation prediction technique and system
WO2021042556A1 (en) * 2019-09-03 2021-03-11 平安科技(深圳)有限公司 Classification model training method, apparatus and device, and computer-readable storage medium
CN110705592A (en) * 2019-09-03 2020-01-17 平安科技(深圳)有限公司 Classification model training method, device, equipment and computer readable storage medium
CN110705592B (en) * 2019-09-03 2024-05-14 平安科技(深圳)有限公司 Classification model training method, device, equipment and computer readable storage medium
CN111475648B (en) * 2020-03-30 2023-11-14 东软集团股份有限公司 Text classification model generation method, text classification device and equipment
CN111475648A (en) * 2020-03-30 2020-07-31 东软集团股份有限公司 Text classification model generation method, text classification method, device and equipment
CN112100377A (en) * 2020-09-14 2020-12-18 腾讯科技(深圳)有限公司 Text classification method and device, computer equipment and storage medium
CN112100377B (en) * 2020-09-14 2024-03-29 腾讯科技(深圳)有限公司 Text classification method, apparatus, computer device and storage medium
CN112115268A (en) * 2020-09-28 2020-12-22 支付宝(杭州)信息技术有限公司 Training method and device and classification method and device based on feature encoder
CN112115268B (en) * 2020-09-28 2024-04-09 支付宝(杭州)信息技术有限公司 Training method and device based on feature encoder, and classifying method and device

Similar Documents

Publication Publication Date Title
CN109492093A (en) File classification method and electronic device based on gauss hybrid models and EM algorithm
Nasiri et al. A whale optimization algorithm (WOA) approach for clustering
Fong et al. Accelerated PSO swarm search feature selection for data stream mining big data
CN109522942A (en) A kind of image classification method, device, terminal device and storage medium
CN110069709B (en) Intention recognition method, device, computer readable medium and electronic equipment
Pelikan et al. Estimation of distribution algorithms
CN109471938A (en) A kind of file classification method and terminal
CN112632385A (en) Course recommendation method and device, computer equipment and medium
CN103116766B (en) A kind of image classification method of encoding based on Increment Artificial Neural Network and subgraph
Zhang et al. Unsupervised difference representation learning for detecting multiple types of changes in multitemporal remote sensing images
CN111475613A (en) Case classification method and device, computer equipment and storage medium
US20110202322A1 (en) Computer Implemented Method for Discovery of Markov Boundaries from Datasets with Hidden Variables
CN111950596A (en) Training method for neural network and related equipment
CN110362723A (en) A kind of topic character representation method, apparatus and storage medium
CN110222171A (en) A kind of application of disaggregated model, disaggregated model training method and device
CN109471944A (en) Training method, device and the readable storage medium storing program for executing of textual classification model
CN110598869B (en) Classification method and device based on sequence model and electronic equipment
CN113657087B (en) Information matching method and device
CN110276382A (en) Listener clustering method, apparatus and medium based on spectral clustering
CN113468338A (en) Big data analysis method for digital cloud service and big data server
CN105164672A (en) Content classification
CN116612307A (en) Solanaceae disease grade identification method based on transfer learning
CN114781611A (en) Natural language processing method, language model training method and related equipment
Zhu et al. Learning reconfigurable scene representation by tangram model
CN110795736A (en) Malicious android software detection method based on SVM decision tree

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication
WW01 Invention patent application withdrawn after publication

Application publication date: 20190319