CN110196908A

CN110196908A - Data classification method, device, computer installation and storage medium

Info

Publication number: CN110196908A
Application number: CN201910310574.1A
Authority: CN
Inventors: 刘康龙; 徐国强; 邱寒
Original assignee: OneConnect Smart Technology Co Ltd
Current assignee: OneConnect Smart Technology Co Ltd
Priority date: 2019-04-17
Filing date: 2019-04-17
Publication date: 2019-09-03

Abstract

The present invention provides a kind of data classification method, device, computer installation and storage medium.The described method includes: obtaining data set to be marked；The data set is labeled by labelling function, obtains the initial labels of the data set；The pairs of correlation that the mark function is calculated according to the initial labels, the generation model of the labelling function is constructed according to the pairs of correlation；According to the probability tag of data set described in the generation model pre-estimating；Discrimination model is trained according to the probability tag, the discrimination model after being trained；Data to be sorted are inputted into the discrimination model after the training, obtain the classification of the data to be sorted.The present invention improves the annotating efficiency and accuracy rate of training data, can quickly train discrimination model using the training data, realize fast and accurately data classification using the discrimination model.

Description

Data classification method, device, computer installation and storage medium

Technical field

The present invention relates to machine learning techniques fields, and in particular to a kind of data classification method, device, computer installation and Computer storage medium.

Background technique

With the fast development of artificial intelligence, machine learning techniques (especially depth learning technology) have been had been applied in respectively In a industry.At this point, training data mark has been increasingly becoming the maximum bottleneck of widespread deployment machine learning system.

Existing traditional artificial mask method takes time and effort and cost is quite high, and existing data enhancement methods such as half are supervised The methods of educational inspector's habit, Active Learning and transfer learning can not quickly generate training data on a large scale.

Suitable scheme how is formulated, the workload of artificial mark training data is reduced, improves the mark effect of training data Rate is the current technical issues that need to address of related technical personnel.

Summary of the invention

In view of the foregoing, it is necessary to propose that a kind of data classification method, device, computer installation and computer storage are situated between Matter can be improved the annotating efficiency of training data, rapidly and accurately classify to data.

The first aspect of the application provides a kind of data classification method, is applied to machine learning system, which comprises

Obtain data set { x to be marked_i| i=1,2 ..., m }；

Pass through labelling function λ_j, j=1,2 ..., n be labeled the data set, obtains the initial of the data set Label Λ_{I, j}=λ_j(x_i), i=1,2 ..., m, j=1,2 ..., n；

The pairs of correlation that the mark function is calculated according to the initial labels constructs institute according to the pairs of correlation State the generation model of labelling function；

According to the probability tag of data set described in the generation model pre-estimating；

It is trained according to discrimination model of the probability tag to the machine learning system, the differentiation after being trained Model；

Data to be sorted are inputted into the discrimination model after the training, obtain the classification of the data to be sorted.

In alternatively possible implementation, the generation model are as follows:

Wherein Λ indicates that the initial labels matrix that the initial labels are constituted, Y indicate true tag matrix, Z_wFor normalization Constant, φ_i(Λ, y_i), i=1,2 ..., m are the pairs of phase for the labelling function of each data in the data set Guan Xing, w are the undetermined parameter for generating model, w ∈ R^2n+|C|。

In alternatively possible implementation, the pairs of correlation are as follows:

Wherein II { Λ_{I, j}=Λ_{I, k}Value of the expression when the condition in bracket { } is set up and is invalid.

In alternatively possible implementation, the data set to be marked is image set, the data to be sorted be to Classification image；Or

The data set to be marked is text set, and the data to be sorted are texts to be sorted；Or

The data set to be marked is voice collection, and the data to be sorted are voices to be sorted.

It is described that data to be sorted are inputted to the discrimination model after the training in alternatively possible implementation, it obtains The classification of the data to be sorted includes:

The image to be classified is inputted into the discrimination model after the training, obtains the corresponding use of the image to be classified Family, object or person face attribute；

By the discrimination model after training described in the text input to be sorted, the corresponding emotion of the text to be sorted is obtained Tendency, subject matter or technical field；

The voice to be sorted is inputted into the discrimination model after the training, obtains the corresponding use of the voice to be sorted Family, age bracket or emotion.

In alternatively possible implementation, it is described according to the probability tag to the differentiation mould of the machine learning system Type, which is trained, includes:

Noise perception variable by minimizing the loss function of the discrimination model trains institute on the probability tag State discrimination model.

It is described to pass through labelling function λ in alternatively possible implementation_j, j=1,2 ..., n to the data set into Before rower note, the method also includes:

Fill the missing values in the data set；And/or

Correct the exceptional value in the data set.

The second aspect of the application provides a kind of device for classifying data, is applied to machine learning system, and described device includes:

Module is obtained, for obtaining data set { x to be marked_i| i=1,2 ..., m }；

Labeling module, for passing through labelling function λ_j, j=1,2 ..., n are labeled the data set, obtain described The initial labels Λ of data set_{I, j}=λ_j(x_i), i=1,2 ..., m, j=1,2 ..., n；

Construct module, for according to the initial labels calculate it is described mark function pairs of correlation, according to it is described at The generation model of the labelling function is constructed to correlation；

Module is estimated, the probability tag for the data set according to the generation model pre-estimating；

Training module is obtained for being trained according to discrimination model of the probability tag to the machine learning system Discrimination model after to training；

Categorization module obtains the data to be sorted for data to be sorted to be inputted to the discrimination model after the training Classification.

In alternatively possible implementation, the generation model are as follows:

In alternatively possible implementation, described device further include:

Preprocessing module, for filling the exception in the missing values in the data set, and/or the amendment data set Value.

The third aspect of the application provides a kind of computer installation, and the computer installation includes processor, the processing Device is for realizing the data classification method when executing the computer program stored in memory.

The fourth aspect of the application provides a kind of computer storage medium, is stored thereon with computer program, the calculating Machine program realizes the data classification method when being executed by processor.

The present invention obtains data set { x to be marked_i| i=1,2 ..., m }；Pass through labelling function λ_j, j=1,2 ..., n The data set is labeled, the initial labels Λ of the data set is obtained_{I, j}=λ_j(x_i), i=1,2 ..., m, j=1, 2 ..., n；The pairs of correlation that the mark function is calculated according to the initial labels constructs institute according to the pairs of correlation State the generation model of labelling function；According to the probability tag of data set described in the generation model pre-estimating；According to the probability mark Label are trained the discrimination model of the machine learning system, the discrimination model after being trained；Data to be sorted are inputted Discrimination model after the training obtains the classification of the data to be sorted.The present invention can quickly generate machine learning system Discrimination model needed for training data, solve artificial mark training data and obtain that difficulty is big, and label time is long, and accuracy rate obtains not To the workload for the technical issues of guarantee, reducing artificial mark training data, the annotating efficiency and accuracy rate of training data are improved, Discrimination model can be quickly trained using the training data, realizes fast and accurately data classification using the discrimination model.

Detailed description of the invention

Fig. 1 is the flow chart of data classification method provided in an embodiment of the present invention.

Fig. 2 is the structure chart of device for classifying data provided in an embodiment of the present invention.

Fig. 3 is the schematic diagram of computer installation provided in an embodiment of the present invention.

Specific embodiment

To better understand the objects, features and advantages of the present invention, with reference to the accompanying drawing and specific real Applying example, the present invention will be described in detail.It should be noted that in the absence of conflict, embodiments herein and embodiment In feature can be combined with each other.

In the following description, numerous specific details are set forth in order to facilitate a full understanding of the present invention, described embodiment is only It is only a part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, ordinary skill Personnel's every other embodiment obtained without making creative work, shall fall within the protection scope of the present invention.

Unless otherwise defined, all technical and scientific terms used herein and belong to technical field of the invention The normally understood meaning of technical staff is identical.Term as used herein in the specification of the present invention is intended merely to description tool The purpose of the embodiment of body, it is not intended that in the limitation present invention.

Preferably, data classification method of the invention is applied in one or more computer installation.The computer Device is that one kind can be according to the instruction for being previously set or storing, the automatic equipment for carrying out numerical value calculating and/or information processing, Hardware includes but is not limited to microprocessor, specific integrated circuit (Application Specific Integrated Circuit, ASIC), programmable gate array (Field-Programmable Gate Array, FPGA), digital processing unit (Digital Signal Processor, DSP), embedded device etc..

The computer installation can be the calculating such as desktop PC, notebook, palm PC and cloud server and set It is standby.The computer installation can carry out people by modes such as keyboard, mouse, remote controler, touch tablet or voice-operated devices with user Machine interaction.

Embodiment one

Fig. 1 is the flow chart for the data classification method that the embodiment of the present invention one provides.The data classification method is applied to Computer installation.

Data classification method of the invention is applied to machine learning system and uses the training for generating training data Data are trained the discrimination model of machine learning system, treat classification data using the discrimination model after training and are divided Class.Training data needed for the data classification method can quickly generate the discrimination model of machine learning system solves artificial It marks training data and obtains the technical issues of difficulty is big, and label time is long, and accuracy rate cannot be guaranteed, reduce artificial mark training The workload of data improves the annotating efficiency and accuracy rate of training data, can quickly train differentiation using the training data Model realizes fast and accurately data classification using the discrimination model.

As shown in Figure 1, the data classification method includes:

Step 101, data set { x to be marked is obtained_i| i=1,2 ..., m }.

Data set to be marked includes multiple data for needing to be labeled.It is obtained after each data mark in data set The label of the data, for being trained to discrimination model.

In the application scenarios that the data classification method is applied to image classification, the data set to be marked be can be Image set.For example, the data set to be marked includes the image of different user, the corresponding user of each image is marked.Again Such as, the data set to be marked includes the image of different objects (such as pen, ball, book), to mark the corresponding object of each image Body.For another example, the data set to be marked includes the image of different faces attribute (such as race, sex, age, expression), Mark the corresponding face character of each image.

In the application scenarios that the data classification method is applied to text classification, the data set to be marked can also be with It is text set.For example, the data set to be marked includes the text of different emotions tendency, the corresponding feelings of each text are marked Sense tendency.For another example, the data set to be marked includes the text of different subject matters, to mark the corresponding subject matter of each text.Again Such as, the data set to be marked includes the text of different technologies field (such as physics, chemistry, machinery), to mark each text This corresponding technical field.

In the application scenarios that the data classification method is applied to Classification of Speech, the data set to be marked can be with It is voice collection.For example, the data set to be marked includes the voice of multiple and different users, the corresponding use of each voice is marked Family.For another example, the data set to be marked includes the voice of multiple and different age bracket users, to mark each voice corresponding year Age section.For another example, the data set to be marked includes the voice of multiple and different emotions, to mark the corresponding emotion of each voice.

Data set to be marked can be made of the data acquired in real time with real-time data collection.For example, can be with captured in real-time Character image, using the character image of each captured in real-time as the data set to be marked.

Alternatively, data set to be marked can be obtained from preset data source.For example, can be obtained from preset database Data constitute data set to be marked, and a large amount of data, such as image can be stored in advance in the presetting database.

Alternatively, can receive the data set to be marked of user's input.For example, can receive multiple figures of user's input Picture, using the multiple images of user's input as the data set to be marked.

Step 102, pass through labelling function λ_j, j=1,2 ..., n are to the data set { x_i| i=1,2 ..., m } it is marked Note, obtains the data set { x_i| i=1,2 ..., m } initial labels Λ_{I, j}=λ_j(x_i), wherein i=1,2 ..., m, j= 1,2 ..., n.

Labelling function is the function for indicating data Yu label mapping relationship, and labelling function receives data and exports the data Label.Labelling function is black box function, can be expressed as λ:Wherein λ indicates that labelling function, X indicate number According to, Y indicates the corresponding initial labels of X,Indicate labelling function abstention.

Compared with manually mark training data, labelling function allows to utilize various Weakly supervised source-information (such as heuristic letters Breath, external knowledge library etc.) generate the initial labels.For example, including two people of first and second in image, to mark two people's Relationship, it is known that first is the father of fourth, and second is mother of fourth, then " A is the father of C, and B is mother-> A of C according to heuristic information It is man and wife with B ", then it obtains first and second is the annotation results (i.e. initial labels) of man and wife.

Labelling function does not require accuracy rate.That is, being insecure according to the initial labels that labelling function obtains.No It reliably may include marking incorrect, a variety of marks, marking insufficient, part mark etc..

It can be previously according to needing to define multiple labelling functions, such as define 6 labelling functions.

Different labelling functions allows to conflict for the annotation results of the same data.For example, 1 labeled data of labelling function For man and wife, 2 labeled data of labelling function is brother and sister.

Step 103, the mark function lambda is calculated according to the initial labels_j, the pairs of correlation of j=1,2 ..., n, The labelling function λ is constructed according to the pairs of correlation_j, the generation model of j=1,2 ..., n.

Labelling function λ_j, j=1, the pairs of correlation of 2 ..., n refers to the dependence between two labelling functions.This Statistics dependence between Method Modeling labelling function, to improve estimated performance.For example, if two labelling functions express class As heuristic information, can be in generating model comprising this dependence and avoid " computing repeatedly " problem.This pairs of correlation Property be most common, thus select labelling function the set C of (j, k) is modeled as it is relevant.

In the present embodiment, labelling function λ_j, the pairs of correlation of j=1,2 ..., n can indicate are as follows:

Wherein, Λ is the initial labels matrix that the initial labels are constituted, Λ_{I, j}=λ_j(x_i)；Y is true tag matrix, Y =(y₁, y₂..., y_m)；C is set of the labelling function to (i, j).II{Λ_{I, j}=Λ_{I, k}Indicate condition within bracket { } at Value when vertical and invalid.In the present embodiment, when the condition in bracket { } is set up, value is 1, the item within bracket { } Value is 0 when part is invalid.

The labelling function λ can be constructed according to the pairs of correlation_j, the generation model of j=1,2 ..., n.

The core operation of this method is that the noise signal provided one group of labelling function is modeled and integrated, by each mark Note function modelling is a noise " voter ", generates mistake relevant to other labelling functions.

For data set { x_i| i=1,2 ..., m } in each data x_i, initial labels vector is Λ_i=(Λ_{I, 1}, Λ_{I, 2,}..., Λ_{I, n}).In the present embodiment, according to the generation model of the pairs of correlation building are as follows:

Wherein Z_wIt is normaliztion constant, φ_i(Λ, y_i) be for each data in the data set the mark letter Number λ_j, the pairs of correlation of j=1,2 ..., n, w is the undetermined parameter for generating model, w ∈ R^2n+|C|。

Step 104, the data set { x according to the generation model pre-estimating_i| i=1,2 ..., m } probability tag

In the present embodiment, model is generated are as follows:

To learn the model in the case where not accessing true tag, it can be in the given initial labels matrix Λ observed In the case where minimize negative logarithm marginal probability:

It can be generated by staggeredly executing stochastic gradient descent step and this target of gibbs sampler optimization order The estimated value of the parameter w of model

The estimated valueThe generation model determines that after determination, and initial labels matrix Λ is inputted the generation mould Data set { the x can be obtained in type_i| i=1,2 ..., m } probability tag

Step 103-104 is exactly using generation model to the initial labels Λ_{I, j}=λ_j(x_i) denoised, it obtains described Data set { x_i| i=1,2 ..., m } probability tag.The data set with the probability tag, that is, machine learning The training data of system.

Step 105, it is trained, is trained according to discrimination model of the probability tag to the machine learning system Discrimination model afterwards.

It is trained according to discrimination model of the probability tag to the machine learning system, being exactly will be with described general The data set of rate label is trained the discrimination model as training sample.

According to the probability tagDiscrimination model is trained, final goal is training one beyond labelling function institute The discrimination model of expressing information.Loss function l (the h of the minimum discrimination model can be passed through_θ(x_i), y) noise perception become Amount(i.e. relative toExpected loss) in probability tagUpper trained discrimination model h_θ:

When being trained to the discrimination model, the parameter of discrimination model is adjusted, obtains the noise perception variable Minimum value.Trained process can use RMSprop algorithm.RMSprop is a kind of improved stochastic gradient descent algorithm. RMSprop algorithm is techniques well known, and details are not described herein again.

Step 106, data to be sorted are inputted into the discrimination model after the training, obtains the class of the data to be sorted Not.

In the application scenarios that the data classification method is applied to image classification, image to be classified is inputted into the training Discrimination model afterwards obtains the classification of the image to be classified.For example, obtaining the corresponding user of the image to be classified.For another example, Obtain the corresponding object of the image to be classified.For another example, the corresponding face character of the image to be classified is obtained.

In the application scenarios that the data classification method is applied to text classification, by training described in text input to be sorted Discrimination model afterwards obtains the classification of the text to be sorted.For example, obtain the text to be sorted Sentiment orientation (such as Positive emotion tendency or negative emotion tendency).For another example, the corresponding subject matter of the text to be sorted is obtained.For another example, obtain it is described to The technical field of classifying text.

It, will be described in voice to be sorted input in the application scenarios that the training data generation method is applied to Classification of Speech Discrimination model after training obtains the classification of the voice to be sorted.For example, obtaining the corresponding user of the voice to be sorted. For another example, the corresponding age bracket of the voice to be sorted is obtained.For another example, the emotion of the voice to be sorted is obtained.

The data classification method of embodiment one obtains data set { x to be marked_i| i=1,2 ..., m }；By marking letter Number λ_j, j=1,2 ..., n be labeled the data set, obtains the initial labels Λ of the data set_{I, j}=λ_j(x_i), i= 1,2 ..., m, j=1,2 ..., n；The pairs of correlation that the mark function is calculated according to the initial labels, according to described Pairs of correlation constructs the generation model of the labelling function；According to the probability mark of data set described in the generation model pre-estimating Label；It is trained according to discrimination model of the probability tag to the machine learning system, the discrimination model after being trained； Data to be sorted are inputted into the discrimination model after the training, obtain the classification of the data to be sorted.The present embodiment can be fast Training data needed for discrimination model of the fast-growing at machine learning system, the artificial mark training data acquisition difficulty of solution is big, mark The technical issues of time is long, and accuracy rate cannot be guaranteed is infused, the workload of artificial mark training data is reduced, improves training data Annotating efficiency and accuracy rate, can quickly train discrimination model using the training data, utilize the discrimination model realize Fast and accurately data classification.

In another embodiment, passing through labelling function λ_jTo the data set { x_i| i=1,2 ..., m } it is labeled it Before, the method can also include: to the data set { x_i| i=1,2 ..., m } it is pre-processed.

To the data set { x_i| i=1,2 ..., m } carry out pretreatment may include filling the data set { x_i| i=1, 2 ..., m } in missing values.

K- nearest neighbor algorithm can be used, determines that distance has the nearest K data of data of missing values (such as according to Europe Formula distance determines that distance has the nearest K data of data of missing values), the numeric weights of K data are averagely estimated this The missing values of data.

Alternatively, can predict missing values using prediction model, if missing values are numeric types, average value can be used The missing values are filled, if missing values are non-numeric types, the missing values can be filled using mode.

Alternatively, missing values can be substituted using averaging method.Preferably due to be using the method for averaging method substitution missing values It establishes on the hypothesis that completely random lacks, the variance and standard deviation that will cause data become smaller, thus, the method can be with Include: that the Filling power obtained after being substituted by mean value and default sampling factor carry out quadrature, obtains new data as final Filling power.The default sampling factor is pre-set sampling factor, and the sampling factor is greater than 1.

The missing values can also be filled using other methods.For example, the method or interpolation of regression fit can be passed through Method is filled missing values.

To the data set { x_i| i=1,2 ..., m } to carry out pretreatment can also include correcting the data set { x_i| i= 1,2 ..., m } in exceptional value.Exceptional value is the numerical value for deviating considerably from other data.

The method for correcting exceptional value can be the same as the method for filling missing values.For example, K- nearest neighbor algorithm can be used, determine K nearest data of data of the distance with exceptional value (such as determine that distance has the data of exceptional value most according to Euclidean distance K close data), by the numeric weights of K data averagely come estimate the data exceptional value correction value.Alternatively, can adopt With prediction model come forecast value revision value, if exceptional value is numeric type, the exceptional value can be corrected using average value, such as Fruit exceptional value is non-numeric type, and the exceptional value can be corrected using mode.

Alternatively, exceptional value can be replaced using averaging method.Preferably due to be using the method for averaging method replacement exceptional value It establishes on the hypothesis that completely random lacks, the variance and standard deviation that will cause data become smaller, thus, the method can be with Include: that the correction value obtained after being replaced by mean value and default sampling factor carry out quadrature, obtains new numerical value as final Correction value.The default sampling factor is pre-set sampling factor, and the sampling factor is greater than 1.

The exceptional value can also be corrected using other methods.For example, the method or interpolation of regression fit can be passed through Method is modified exceptional value.

The method of amendment exceptional value can also be different the method with filling missing values.

Pre-processing to the data set can also include that directly discarding has the data of missing values and/or has exceptional value Data.Directly the data for having missing values and/or the data for having exceptional value are abandoned, it is ensured that data set it is clean.

Embodiment two

Fig. 2 is the structure chart of device for classifying data provided by Embodiment 2 of the present invention.The device for classifying data 20 is applied In machine learning system, for generating training data, carried out using discrimination model of the training data to machine learning system Training, treats classification data using the discrimination model after training and classifies.The device for classifying data 20 can quickly generate Training data needed for the discrimination model of machine learning system, the artificial mark training data acquisition difficulty of solution is big, label time The technical issues of length, accuracy rate cannot be guaranteed, reduces the workload of artificial mark training data, improves the mark of training data Efficiency and accuracy rate can quickly train discrimination model using the training data, be realized using the discrimination model quickly quasi- True data classification.

As shown in Fig. 2, the device for classifying data 20 may include obtaining module 201, labeling module 202, building module 203, module 204, training module 205, categorization module 206 are estimated.

Module 201 is obtained, for obtaining data set { x to be marked_i| i=1,2 ..., m }.

Labeling module 202, for passing through labelling function λ_jTo the data set { x_i| i=1,2 ..., m } it is labeled, it obtains To the data set { x_i| i=1,2 ..., m } initial labels Λ_{I, j}=λ_j(x_i), wherein i=1,2 ..., m, j=1, 2 ..., n.

Module 203 is constructed, for calculating the mark function lambda according to the initial labels_j, j=1,2 ..., n's is pairs of Correlation constructs the labelling function λ according to the pairs of correlation_j, the generation model of j=1,2 ..., n.

Wherein, Λ is the initial labels matrix that the initial labels are constituted, Λ_{I, j}=λ_j(x_i)；Y is true tag matrix, Y =(y₁, y₂..., y_m)；C is set of the labelling function to (i, j).II{Λ_{I, j}=Λ_{I, k}Indicate condition within bracket { } at Value when vertical and invalid.In the present embodiment, when the condition in bracket { } is set up, value is 1, the item within bracket { } Value is O when part is invalid.

Module 204 is estimated, the data set { x according to the generation model pre-estimating is used for_i| i=1,2 ..., m } probability Label

In the present embodiment, model is generated are as follows:

Constructing module 203, estimating module 204 is exactly using generation model to the initial labels Λ_{I, j}=λ_j(x_i) carry out Denoising, obtains the data set { x_i| i=1,2 ..., m } probability tag.The data set with the probability tag is The training data of the obtained machine learning system.

Training module 205, for being trained according to discrimination model of the probability tag to the machine learning system, Discrimination model after being trained.

Categorization module 206 obtains the number to be sorted for data to be sorted to be inputted to the discrimination model after the training According to classification.

The device for classifying data 20 of embodiment two obtains data set { x to be marked_i| i=1,2 ..., m }；Pass through mark Function lambda_j, j=1,2 ..., n be labeled the data set, obtains the initial labels Λ of the data set_{I, j}=λ_j(x_i), i =1,2 ..., m, j=1,2 ..., n；The pairs of correlation that the mark function is calculated according to the initial labels, according to institute State the generation model that pairs of correlation constructs the labelling function；According to the probability mark of data set described in the generation model pre-estimating Label；It is trained according to discrimination model of the probability tag to the machine learning system, the discrimination model after being trained； Data to be sorted are inputted into the discrimination model after the training, obtain the classification of the data to be sorted.The present embodiment can be fast Training data needed for discrimination model of the fast-growing at machine learning system, the artificial mark training data acquisition difficulty of solution is big, mark The technical issues of time is long, and accuracy rate cannot be guaranteed is infused, the workload of artificial mark training data is reduced, improves training data Annotating efficiency and accuracy rate, can quickly train discrimination model using the training data, utilize the discrimination model realize Fast and accurately data classification.

In another embodiment, the device for classifying data 20 can also include: preprocessing module, for the data Collect { x_i| i=1,2 ..., m } it is pre-processed.

Embodiment three

The present embodiment provides a kind of computer storage medium, it is stored with computer program in the computer storage medium, it should The step in above-mentioned data classification method embodiment, such as step shown in FIG. 1 are realized when computer program is executed by processor 101-106:

Step 101, data set { x to be marked is obtained_i| i=1,2 ..., m }；

Step 102, pass through labelling function λ_j, j=1,2 ..., n be labeled the data set, obtains the data The initial labels Λ of collection_{I, j}=λ_j(x_i), i=1,2 ..., m, j=1,2 ..., n；

Step 103, the pairs of correlation that the mark function is calculated according to the initial labels, according to the pairs of correlation Property constructs the generation model of the labelling function；

Step 104, the probability tag of the data set according to the generation model pre-estimating；

Step 105, it is trained, is trained according to discrimination model of the probability tag to the machine learning system Discrimination model afterwards；

Alternatively, the function of each module in above-mentioned apparatus embodiment is realized when the computer program is executed by processor, such as Module 201-206 in Fig. 2:

Module 201 is obtained, for obtaining data set { x to be marked_i| i=1,2 ..., m }；

Labeling module 202, for passing through labelling function λ_j, j=1,2 ..., n are labeled the data set, obtain The initial labels Λ of the data set_{I, j}=λ_j(x_i), i=1,2 ..., m, j=1,2 ..., n；

Module 203 is constructed, for calculating the pairs of correlation of the mark function according to the initial labels, according to described Pairs of correlation constructs the generation model of the labelling function；

Module 204 is estimated, the probability tag for the data set according to the generation model pre-estimating；

Training module 205, for being trained according to discrimination model of the probability tag to the machine learning system, Discrimination model after being trained；

Example IV

Fig. 3 is the schematic diagram for the computer installation that the embodiment of the present invention four provides.The computer installation 30 includes storage Device 301, processor 302 and it is stored in the computer program that can be run in the memory 301 and on the processor 302 303, such as data classification program.The processor 302 realizes above-mentioned data classification method when executing the computer program 303 Step in embodiment, such as step 101-106 shown in FIG. 1:

Step 101, data set { x to be marked is obtained_i| i=1,2 ..., m }；

Illustratively, the computer program 303 can be divided into one or more modules, one or more of Module is stored in the memory 301, and is executed by the processor 302, to complete this method.It is one or more of Module can be the series of computation machine program instruction section that can complete specific function, and the instruction segment is for describing the computer Implementation procedure of the program 303 in the computer installation 30.For example, the computer program 303 can be divided into Fig. 2 Acquisition module 201, labeling module 202, building module 203, estimate module 204, training module 205, categorization module 206, respectively Module concrete function is referring to embodiment two.

The computer installation 30 can be the calculating such as desktop PC, notebook, palm PC and cloud server Equipment.It will be understood by those skilled in the art that the schematic diagram 3 is only the example of computer installation 30, do not constitute to meter The restriction of calculation machine device 30 may include perhaps combining certain components or different portions than illustrating more or fewer components Part, such as the computer installation 30 can also include input-output equipment, network access equipment, bus etc..

Alleged processor 302 can be central processing unit (Central Processing Unit, CPU), can also be Other general processors, digital signal processor (Digital Signal Processor, DSP), specific integrated circuit (Application Specific Integrated Circuit, ASIC), ready-made programmable gate array (Field- Programmable Gate Array, FPGA) either other programmable logic device, discrete gate or transistor logic, Discrete hardware components etc..General processor can be microprocessor or the processor 302 is also possible to any conventional processing Device etc., the processor 302 are the control centres of the computer installation 30, are entirely calculated using various interfaces and connection The various pieces of machine device 30.

The memory 301 can be used for storing the computer program 303, and the processor 302 is by operation or executes The computer program or module being stored in the memory 301, and the data being stored in memory 301 are called, it realizes The various functions of the computer installation 30.The memory 301 can mainly include storing program area and storage data area, In, storing program area can application program needed for storage program area, at least one function (such as sound-playing function, image Playing function etc.) etc.；Storage data area, which can be stored, uses created data (such as audio number according to computer installation 30 According to, phone directory etc.) etc..In addition, memory 301 may include high-speed random access memory, it can also include non-volatile deposit Reservoir, such as hard disk, memory, plug-in type hard disk, intelligent memory card (Smart Media Card, SMC), secure digital (Secure Digital, SD) card, flash card (Flash Card), at least one disk memory, flush memory device or other Volatile solid-state part.

If the integrated module of the computer installation 30 is realized in the form of software function module and as independent production Product when selling or using, can store in a computer storage medium.Based on this understanding, the present invention realizes above-mentioned reality The all or part of the process in a method is applied, relevant hardware can also be instructed to complete by computer program, it is described Computer program can be stored in a computer storage medium, and the computer program is when being executed by processor, it can be achieved that above-mentioned The step of each embodiment of the method.Wherein, the computer program includes computer program code, the computer program code It can be source code form, object identification code form, executable file or certain intermediate forms etc..The computer-readable medium can With include: can carry the computer program code any entity or device, recording medium, USB flash disk, mobile hard disk, magnetic disk, CD, computer storage, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), electric carrier signal, telecommunication signal and software distribution medium etc..It should be noted that the computer The content that readable medium includes can carry out increase and decrease appropriate according to the requirement made laws in jurisdiction with patent practice, such as It does not include electric carrier signal and telecommunication signal according to legislation and patent practice, computer-readable medium in certain jurisdictions.

In several embodiments provided by the present invention, it should be understood that disclosed system, device and method can be with It realizes by another way.For example, the apparatus embodiments described above are merely exemplary, for example, the module It divides, only a kind of logical function partition, there may be another division manner in actual implementation.

The module as illustrated by the separation member may or may not be physically separated, aobvious as module The component shown may or may not be physical module, it can and it is in one place, or may be distributed over multiple In network unit.Some or all of the modules therein can be selected to realize the mesh of this embodiment scheme according to the actual needs 's.

It, can also be in addition, each functional module in each embodiment of the present invention can integrate in a processing module It is that modules physically exist alone, can also be integrated in two or more modules in a module.Above-mentioned integrated mould Block both can take the form of hardware realization, can also realize in the form of hardware adds software function module.

The above-mentioned integrated module realized in the form of software function module, can store in a computer storage medium In.Above-mentioned software function module is stored in a storage medium, including some instructions are used so that a computer equipment (can To be personal computer, server or the network equipment etc.) or each embodiment of processor (processor) the execution present invention The part steps of the method.

It is obvious to a person skilled in the art that invention is not limited to the details of the above exemplary embodiments, Er Qie In the case where without departing substantially from spirit or essential attributes of the invention, the present invention can be realized in other specific forms.Therefore, no matter From the point of view of which point, the present embodiments are to be considered as illustrative and not restrictive, and the scope of the present invention is by appended power Benefit requires rather than above description limits, it is intended that all by what is fallen within the meaning and scope of the equivalent elements of the claims Variation is included in the present invention.Any attached associated diagram label in claim should not be considered as right involved in limitation to want It asks.Furthermore, it is to be understood that one word of " comprising " is not excluded for other modules or step, odd number is not excluded for plural number.It is stated in system claims Multiple modules or device can also be implemented through software or hardware by a module or device.The first, the second equal words It is used to indicate names, and does not indicate any particular order.

Finally it should be noted that the above examples are only used to illustrate the technical scheme of the present invention and are not limiting, although reference Preferred embodiment describes the invention in detail, those skilled in the art should understand that, it can be to of the invention Technical solution is modified or equivalent replacement, without departing from the spirit and scope of the technical solution of the present invention.

Claims

1. a kind of data classification method is applied to machine learning system, which is characterized in that the described method includes:

Obtain data set { x to be marked_i| i=1,2 ..., m }；

Pass through labelling function λ_j, j=1,2 ..., n be labeled the data set, obtains the initial labels of the data set Λ_i,j=λ_j(x_i), i=1,2 ..., m, j=1,2 ..., n；

The pairs of correlation that the mark function is calculated according to the initial labels constructs the mark according to the pairs of correlation Infuse the generation model of function；

It is trained according to discrimination model of the probability tag to the machine learning system, the differentiation mould after being trained Type；

2. the method as described in claim 1, which is characterized in that the generation model are as follows:

Wherein Λ indicates that the initial labels matrix that the initial labels are constituted, Y indicate true tag matrix, Z_wFor normaliztion constant, φ_i(Λ,y_i), i=1,2 ..., m are the pairs of correlation for the labelling function of each data in the data set, w For the undetermined parameter for generating model, w ∈ R^2n+|C|。

3. method according to claim 2, which is characterized in that the pairs of correlation are as follows:

WhereinIndicate the value when the condition in bracket { } is set up and is invalid.

4. the method as described in claim 1, it is characterised in that:

The data set to be marked is image set, and the data to be sorted are images to be classified；Or

5. method as claimed in claim 4, which is characterized in that described that data to be sorted are inputted to the differentiation mould after the training Type, the classification for obtaining the data to be sorted include:

The image to be classified is inputted into the discrimination model after the training, obtains the corresponding user of the image to be classified, object Body or face character；

By the discrimination model after training described in the text input to be sorted, obtains the corresponding emotion of the text to be sorted and incline To, subject matter or technical field；

The voice to be sorted is inputted into the discrimination model after the training, obtains the corresponding user of the voice to be sorted, year Age section or emotion.

6. the method as described in claim 1, which is characterized in that it is described according to the probability tag to the machine learning system Discrimination model be trained and include:

Noise perception variable by minimizing the loss function of the discrimination model is sentenced described in training on the probability tag Other model.

7. such as method of any of claims 1-6, which is characterized in that described to pass through labelling function λ_j, j=1,2 ..., Before n is labeled the data set, the method also includes:

Fill the missing values in the data set；And/or

Correct the exceptional value in the data set.

8. a kind of device for classifying data, it is applied to machine learning system, which is characterized in that described device includes:

Labeling module, for passing through labelling function λ_j, j=1,2 ..., n be labeled the data set, obtains the data set Initial labels Λ_i,j=λ_j(x_i), i=1,2 ..., m, j=1,2 ..., n；

Module is constructed, for calculating the pairs of correlation of the mark function according to the initial labels, according to the pairs of phase Closing property constructs the generation model of the labelling function；

Training module is instructed for being trained according to discrimination model of the probability tag to the machine learning system Discrimination model after white silk；

Categorization module obtains the class of the data to be sorted for data to be sorted to be inputted to the discrimination model after the training Not.

9. a kind of computer installation, it is characterised in that: the computer installation includes processor, and the processor is deposited for executing The computer program stored in reservoir is to realize the data classification method as described in any one of claim 1-7.

10. a kind of computer storage medium, computer program is stored in the computer storage medium, it is characterised in that: institute It states and realizes the data classification method as described in any one of claim 1-7 when computer program is executed by processor.