CN110196908A - Data classification method, device, computer installation and storage medium - Google Patents
Data classification method, device, computer installation and storage medium Download PDFInfo
- Publication number
- CN110196908A CN110196908A CN201910310574.1A CN201910310574A CN110196908A CN 110196908 A CN110196908 A CN 110196908A CN 201910310574 A CN201910310574 A CN 201910310574A CN 110196908 A CN110196908 A CN 110196908A
- Authority
- CN
- China
- Prior art keywords
- data
- data set
- sorted
- training
- discrimination model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 101
- 238000009434 installation Methods 0.000 title claims abstract description 24
- 238000003860 storage Methods 0.000 title claims abstract description 18
- 238000012549 training Methods 0.000 claims abstract description 90
- 238000002372 labelling Methods 0.000 claims abstract description 86
- 230000006870 function Effects 0.000 claims description 116
- 238000010801 machine learning Methods 0.000 claims description 36
- 238000004590 computer program Methods 0.000 claims description 22
- 230000008451 emotion Effects 0.000 claims description 18
- 239000011159 matrix material Substances 0.000 claims description 14
- 230000008447 perception Effects 0.000 claims description 7
- 230000004069 differentiation Effects 0.000 claims description 6
- 238000005070 sampling Methods 0.000 description 16
- 238000004422 calculation algorithm Methods 0.000 description 10
- 238000012935 Averaging Methods 0.000 description 8
- 238000012937 correction Methods 0.000 description 6
- 238000012545 processing Methods 0.000 description 6
- 238000010586 diagram Methods 0.000 description 4
- 238000007781 pre-processing Methods 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 3
- 230000008569 process Effects 0.000 description 3
- 230000008901 benefit Effects 0.000 description 2
- 238000013480 data collection Methods 0.000 description 2
- 235000013399 edible fruits Nutrition 0.000 description 2
- 238000013507 mapping Methods 0.000 description 2
- 238000010606 normalization Methods 0.000 description 2
- 238000005457 optimization Methods 0.000 description 2
- 238000006467 substitution reaction Methods 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000013499 data model Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000005192 partition Methods 0.000 description 1
- 238000000926 separation method Methods 0.000 description 1
- 238000013526 transfer learning Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/50—Information retrieval; Database structures therefor; File system structures therefor of still image data
- G06F16/55—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention provides a kind of data classification method, device, computer installation and storage medium.The described method includes: obtaining data set to be marked;The data set is labeled by labelling function, obtains the initial labels of the data set;The pairs of correlation that the mark function is calculated according to the initial labels, the generation model of the labelling function is constructed according to the pairs of correlation;According to the probability tag of data set described in the generation model pre-estimating;Discrimination model is trained according to the probability tag, the discrimination model after being trained;Data to be sorted are inputted into the discrimination model after the training, obtain the classification of the data to be sorted.The present invention improves the annotating efficiency and accuracy rate of training data, can quickly train discrimination model using the training data, realize fast and accurately data classification using the discrimination model.
Description
Technical field
The present invention relates to machine learning techniques fields, and in particular to a kind of data classification method, device, computer installation and
Computer storage medium.
Background technique
With the fast development of artificial intelligence, machine learning techniques (especially depth learning technology) have been had been applied in respectively
In a industry.At this point, training data mark has been increasingly becoming the maximum bottleneck of widespread deployment machine learning system.
Existing traditional artificial mask method takes time and effort and cost is quite high, and existing data enhancement methods such as half are supervised
The methods of educational inspector's habit, Active Learning and transfer learning can not quickly generate training data on a large scale.
Suitable scheme how is formulated, the workload of artificial mark training data is reduced, improves the mark effect of training data
Rate is the current technical issues that need to address of related technical personnel.
Summary of the invention
In view of the foregoing, it is necessary to propose that a kind of data classification method, device, computer installation and computer storage are situated between
Matter can be improved the annotating efficiency of training data, rapidly and accurately classify to data.
The first aspect of the application provides a kind of data classification method, is applied to machine learning system, which comprises
Obtain data set { x to be markedi| i=1,2 ..., m };
Pass through labelling function λj, j=1,2 ..., n be labeled the data set, obtains the initial of the data set
Label ΛI, j=λj(xi), i=1,2 ..., m, j=1,2 ..., n;
The pairs of correlation that the mark function is calculated according to the initial labels constructs institute according to the pairs of correlation
State the generation model of labelling function;
According to the probability tag of data set described in the generation model pre-estimating;
It is trained according to discrimination model of the probability tag to the machine learning system, the differentiation after being trained
Model;
Data to be sorted are inputted into the discrimination model after the training, obtain the classification of the data to be sorted.
In alternatively possible implementation, the generation model are as follows:
Wherein Λ indicates that the initial labels matrix that the initial labels are constituted, Y indicate true tag matrix, ZwFor normalization
Constant, φi(Λ, yi), i=1,2 ..., m are the pairs of phase for the labelling function of each data in the data set
Guan Xing, w are the undetermined parameter for generating model, w ∈ R2n+|C|。
In alternatively possible implementation, the pairs of correlation are as follows:
Wherein II { ΛI, j=ΛI, kValue of the expression when the condition in bracket { } is set up and is invalid.
In alternatively possible implementation, the data set to be marked is image set, the data to be sorted be to
Classification image;Or
The data set to be marked is text set, and the data to be sorted are texts to be sorted;Or
The data set to be marked is voice collection, and the data to be sorted are voices to be sorted.
It is described that data to be sorted are inputted to the discrimination model after the training in alternatively possible implementation, it obtains
The classification of the data to be sorted includes:
The image to be classified is inputted into the discrimination model after the training, obtains the corresponding use of the image to be classified
Family, object or person face attribute;
By the discrimination model after training described in the text input to be sorted, the corresponding emotion of the text to be sorted is obtained
Tendency, subject matter or technical field;
The voice to be sorted is inputted into the discrimination model after the training, obtains the corresponding use of the voice to be sorted
Family, age bracket or emotion.
In alternatively possible implementation, it is described according to the probability tag to the differentiation mould of the machine learning system
Type, which is trained, includes:
Noise perception variable by minimizing the loss function of the discrimination model trains institute on the probability tag
State discrimination model.
It is described to pass through labelling function λ in alternatively possible implementationj, j=1,2 ..., n to the data set into
Before rower note, the method also includes:
Fill the missing values in the data set;And/or
Correct the exceptional value in the data set.
The second aspect of the application provides a kind of device for classifying data, is applied to machine learning system, and described device includes:
Module is obtained, for obtaining data set { x to be markedi| i=1,2 ..., m };
Labeling module, for passing through labelling function λj, j=1,2 ..., n are labeled the data set, obtain described
The initial labels Λ of data setI, j=λj(xi), i=1,2 ..., m, j=1,2 ..., n;
Construct module, for according to the initial labels calculate it is described mark function pairs of correlation, according to it is described at
The generation model of the labelling function is constructed to correlation;
Module is estimated, the probability tag for the data set according to the generation model pre-estimating;
Training module is obtained for being trained according to discrimination model of the probability tag to the machine learning system
Discrimination model after to training;
Categorization module obtains the data to be sorted for data to be sorted to be inputted to the discrimination model after the training
Classification.
In alternatively possible implementation, the generation model are as follows:
Wherein Λ indicates that the initial labels matrix that the initial labels are constituted, Y indicate true tag matrix, ZwFor normalization
Constant, φi(Λ, yi), i=1,2 ..., m are the pairs of phase for the labelling function of each data in the data set
Guan Xing, w are the undetermined parameter for generating model, w ∈ R2n+|C|。
In alternatively possible implementation, the pairs of correlation are as follows:
Wherein II { ΛI, j=ΛI, kValue of the expression when the condition in bracket { } is set up and is invalid.
In alternatively possible implementation, the data set to be marked is image set, the data to be sorted be to
Classification image;Or
The data set to be marked is text set, and the data to be sorted are texts to be sorted;Or
The data set to be marked is voice collection, and the data to be sorted are voices to be sorted.
It is described that data to be sorted are inputted to the discrimination model after the training in alternatively possible implementation, it obtains
The classification of the data to be sorted includes:
The image to be classified is inputted into the discrimination model after the training, obtains the corresponding use of the image to be classified
Family, object or person face attribute;
By the discrimination model after training described in the text input to be sorted, the corresponding emotion of the text to be sorted is obtained
Tendency, subject matter or technical field;
The voice to be sorted is inputted into the discrimination model after the training, obtains the corresponding use of the voice to be sorted
Family, age bracket or emotion.
In alternatively possible implementation, it is described according to the probability tag to the differentiation mould of the machine learning system
Type, which is trained, includes:
Noise perception variable by minimizing the loss function of the discrimination model trains institute on the probability tag
State discrimination model.
In alternatively possible implementation, described device further include:
Preprocessing module, for filling the exception in the missing values in the data set, and/or the amendment data set
Value.
The third aspect of the application provides a kind of computer installation, and the computer installation includes processor, the processing
Device is for realizing the data classification method when executing the computer program stored in memory.
The fourth aspect of the application provides a kind of computer storage medium, is stored thereon with computer program, the calculating
Machine program realizes the data classification method when being executed by processor.
The present invention obtains data set { x to be markedi| i=1,2 ..., m };Pass through labelling function λj, j=1,2 ..., n
The data set is labeled, the initial labels Λ of the data set is obtainedI, j=λj(xi), i=1,2 ..., m, j=1,
2 ..., n;The pairs of correlation that the mark function is calculated according to the initial labels constructs institute according to the pairs of correlation
State the generation model of labelling function;According to the probability tag of data set described in the generation model pre-estimating;According to the probability mark
Label are trained the discrimination model of the machine learning system, the discrimination model after being trained;Data to be sorted are inputted
Discrimination model after the training obtains the classification of the data to be sorted.The present invention can quickly generate machine learning system
Discrimination model needed for training data, solve artificial mark training data and obtain that difficulty is big, and label time is long, and accuracy rate obtains not
To the workload for the technical issues of guarantee, reducing artificial mark training data, the annotating efficiency and accuracy rate of training data are improved,
Discrimination model can be quickly trained using the training data, realizes fast and accurately data classification using the discrimination model.
Detailed description of the invention
Fig. 1 is the flow chart of data classification method provided in an embodiment of the present invention.
Fig. 2 is the structure chart of device for classifying data provided in an embodiment of the present invention.
Fig. 3 is the schematic diagram of computer installation provided in an embodiment of the present invention.
Specific embodiment
To better understand the objects, features and advantages of the present invention, with reference to the accompanying drawing and specific real
Applying example, the present invention will be described in detail.It should be noted that in the absence of conflict, embodiments herein and embodiment
In feature can be combined with each other.
In the following description, numerous specific details are set forth in order to facilitate a full understanding of the present invention, described embodiment is only
It is only a part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, ordinary skill
Personnel's every other embodiment obtained without making creative work, shall fall within the protection scope of the present invention.
Unless otherwise defined, all technical and scientific terms used herein and belong to technical field of the invention
The normally understood meaning of technical staff is identical.Term as used herein in the specification of the present invention is intended merely to description tool
The purpose of the embodiment of body, it is not intended that in the limitation present invention.
Preferably, data classification method of the invention is applied in one or more computer installation.The computer
Device is that one kind can be according to the instruction for being previously set or storing, the automatic equipment for carrying out numerical value calculating and/or information processing,
Hardware includes but is not limited to microprocessor, specific integrated circuit (Application Specific Integrated
Circuit, ASIC), programmable gate array (Field-Programmable Gate Array, FPGA), digital processing unit
(Digital Signal Processor, DSP), embedded device etc..
The computer installation can be the calculating such as desktop PC, notebook, palm PC and cloud server and set
It is standby.The computer installation can carry out people by modes such as keyboard, mouse, remote controler, touch tablet or voice-operated devices with user
Machine interaction.
Embodiment one
Fig. 1 is the flow chart for the data classification method that the embodiment of the present invention one provides.The data classification method is applied to
Computer installation.
Data classification method of the invention is applied to machine learning system and uses the training for generating training data
Data are trained the discrimination model of machine learning system, treat classification data using the discrimination model after training and are divided
Class.Training data needed for the data classification method can quickly generate the discrimination model of machine learning system solves artificial
It marks training data and obtains the technical issues of difficulty is big, and label time is long, and accuracy rate cannot be guaranteed, reduce artificial mark training
The workload of data improves the annotating efficiency and accuracy rate of training data, can quickly train differentiation using the training data
Model realizes fast and accurately data classification using the discrimination model.
As shown in Figure 1, the data classification method includes:
Step 101, data set { x to be marked is obtainedi| i=1,2 ..., m }.
Data set to be marked includes multiple data for needing to be labeled.It is obtained after each data mark in data set
The label of the data, for being trained to discrimination model.
In the application scenarios that the data classification method is applied to image classification, the data set to be marked be can be
Image set.For example, the data set to be marked includes the image of different user, the corresponding user of each image is marked.Again
Such as, the data set to be marked includes the image of different objects (such as pen, ball, book), to mark the corresponding object of each image
Body.For another example, the data set to be marked includes the image of different faces attribute (such as race, sex, age, expression),
Mark the corresponding face character of each image.
In the application scenarios that the data classification method is applied to text classification, the data set to be marked can also be with
It is text set.For example, the data set to be marked includes the text of different emotions tendency, the corresponding feelings of each text are marked
Sense tendency.For another example, the data set to be marked includes the text of different subject matters, to mark the corresponding subject matter of each text.Again
Such as, the data set to be marked includes the text of different technologies field (such as physics, chemistry, machinery), to mark each text
This corresponding technical field.
In the application scenarios that the data classification method is applied to Classification of Speech, the data set to be marked can be with
It is voice collection.For example, the data set to be marked includes the voice of multiple and different users, the corresponding use of each voice is marked
Family.For another example, the data set to be marked includes the voice of multiple and different age bracket users, to mark each voice corresponding year
Age section.For another example, the data set to be marked includes the voice of multiple and different emotions, to mark the corresponding emotion of each voice.
Data set to be marked can be made of the data acquired in real time with real-time data collection.For example, can be with captured in real-time
Character image, using the character image of each captured in real-time as the data set to be marked.
Alternatively, data set to be marked can be obtained from preset data source.For example, can be obtained from preset database
Data constitute data set to be marked, and a large amount of data, such as image can be stored in advance in the presetting database.
Alternatively, can receive the data set to be marked of user's input.For example, can receive multiple figures of user's input
Picture, using the multiple images of user's input as the data set to be marked.
Step 102, pass through labelling function λj, j=1,2 ..., n are to the data set { xi| i=1,2 ..., m } it is marked
Note, obtains the data set { xi| i=1,2 ..., m } initial labels ΛI, j=λj(xi), wherein i=1,2 ..., m, j=
1,2 ..., n.
Labelling function is the function for indicating data Yu label mapping relationship, and labelling function receives data and exports the data
Label.Labelling function is black box function, can be expressed as λ:Wherein λ indicates that labelling function, X indicate number
According to, Y indicates the corresponding initial labels of X,Indicate labelling function abstention.
Compared with manually mark training data, labelling function allows to utilize various Weakly supervised source-information (such as heuristic letters
Breath, external knowledge library etc.) generate the initial labels.For example, including two people of first and second in image, to mark two people's
Relationship, it is known that first is the father of fourth, and second is mother of fourth, then " A is the father of C, and B is mother-> A of C according to heuristic information
It is man and wife with B ", then it obtains first and second is the annotation results (i.e. initial labels) of man and wife.
Labelling function does not require accuracy rate.That is, being insecure according to the initial labels that labelling function obtains.No
It reliably may include marking incorrect, a variety of marks, marking insufficient, part mark etc..
It can be previously according to needing to define multiple labelling functions, such as define 6 labelling functions.
Different labelling functions allows to conflict for the annotation results of the same data.For example, 1 labeled data of labelling function
For man and wife, 2 labeled data of labelling function is brother and sister.
Step 103, the mark function lambda is calculated according to the initial labelsj, the pairs of correlation of j=1,2 ..., n,
The labelling function λ is constructed according to the pairs of correlationj, the generation model of j=1,2 ..., n.
Labelling function λj, j=1, the pairs of correlation of 2 ..., n refers to the dependence between two labelling functions.This
Statistics dependence between Method Modeling labelling function, to improve estimated performance.For example, if two labelling functions express class
As heuristic information, can be in generating model comprising this dependence and avoid " computing repeatedly " problem.This pairs of correlation
Property be most common, thus select labelling function the set C of (j, k) is modeled as it is relevant.
In the present embodiment, labelling function λj, the pairs of correlation of j=1,2 ..., n can indicate are as follows:
Wherein, Λ is the initial labels matrix that the initial labels are constituted, ΛI, j=λj(xi);Y is true tag matrix, Y
=(y1, y2..., ym);C is set of the labelling function to (i, j).II{ΛI, j=ΛI, kIndicate condition within bracket { } at
Value when vertical and invalid.In the present embodiment, when the condition in bracket { } is set up, value is 1, the item within bracket { }
Value is 0 when part is invalid.
The labelling function λ can be constructed according to the pairs of correlationj, the generation model of j=1,2 ..., n.
The core operation of this method is that the noise signal provided one group of labelling function is modeled and integrated, by each mark
Note function modelling is a noise " voter ", generates mistake relevant to other labelling functions.
For data set { xi| i=1,2 ..., m } in each data xi, initial labels vector is Λi=(ΛI, 1,
ΛI, 2,..., ΛI, n).In the present embodiment, according to the generation model of the pairs of correlation building are as follows:
Wherein ZwIt is normaliztion constant, φi(Λ, yi) be for each data in the data set the mark letter
Number λj, the pairs of correlation of j=1,2 ..., n, w is the undetermined parameter for generating model, w ∈ R2n+|C|。
Step 104, the data set { x according to the generation model pre-estimatingi| i=1,2 ..., m } probability tag
In the present embodiment, model is generated are as follows:
To learn the model in the case where not accessing true tag, it can be in the given initial labels matrix Λ observed
In the case where minimize negative logarithm marginal probability:
It can be generated by staggeredly executing stochastic gradient descent step and this target of gibbs sampler optimization order
The estimated value of the parameter w of model
The estimated valueThe generation model determines that after determination, and initial labels matrix Λ is inputted the generation mould
Data set { the x can be obtained in typei| i=1,2 ..., m } probability tag
Step 103-104 is exactly using generation model to the initial labels ΛI, j=λj(xi) denoised, it obtains described
Data set { xi| i=1,2 ..., m } probability tag.The data set with the probability tag, that is, machine learning
The training data of system.
Step 105, it is trained, is trained according to discrimination model of the probability tag to the machine learning system
Discrimination model afterwards.
It is trained according to discrimination model of the probability tag to the machine learning system, being exactly will be with described general
The data set of rate label is trained the discrimination model as training sample.
According to the probability tagDiscrimination model is trained, final goal is training one beyond labelling function institute
The discrimination model of expressing information.Loss function l (the h of the minimum discrimination model can be passed throughθ(xi), y) noise perception become
Amount(i.e. relative toExpected loss) in probability tagUpper trained discrimination model hθ:
When being trained to the discrimination model, the parameter of discrimination model is adjusted, obtains the noise perception variable
Minimum value.Trained process can use RMSprop algorithm.RMSprop is a kind of improved stochastic gradient descent algorithm.
RMSprop algorithm is techniques well known, and details are not described herein again.
Step 106, data to be sorted are inputted into the discrimination model after the training, obtains the class of the data to be sorted
Not.
In the application scenarios that the data classification method is applied to image classification, image to be classified is inputted into the training
Discrimination model afterwards obtains the classification of the image to be classified.For example, obtaining the corresponding user of the image to be classified.For another example,
Obtain the corresponding object of the image to be classified.For another example, the corresponding face character of the image to be classified is obtained.
In the application scenarios that the data classification method is applied to text classification, by training described in text input to be sorted
Discrimination model afterwards obtains the classification of the text to be sorted.For example, obtain the text to be sorted Sentiment orientation (such as
Positive emotion tendency or negative emotion tendency).For another example, the corresponding subject matter of the text to be sorted is obtained.For another example, obtain it is described to
The technical field of classifying text.
It, will be described in voice to be sorted input in the application scenarios that the training data generation method is applied to Classification of Speech
Discrimination model after training obtains the classification of the voice to be sorted.For example, obtaining the corresponding user of the voice to be sorted.
For another example, the corresponding age bracket of the voice to be sorted is obtained.For another example, the emotion of the voice to be sorted is obtained.
The data classification method of embodiment one obtains data set { x to be markedi| i=1,2 ..., m };By marking letter
Number λj, j=1,2 ..., n be labeled the data set, obtains the initial labels Λ of the data setI, j=λj(xi), i=
1,2 ..., m, j=1,2 ..., n;The pairs of correlation that the mark function is calculated according to the initial labels, according to described
Pairs of correlation constructs the generation model of the labelling function;According to the probability mark of data set described in the generation model pre-estimating
Label;It is trained according to discrimination model of the probability tag to the machine learning system, the discrimination model after being trained;
Data to be sorted are inputted into the discrimination model after the training, obtain the classification of the data to be sorted.The present embodiment can be fast
Training data needed for discrimination model of the fast-growing at machine learning system, the artificial mark training data acquisition difficulty of solution is big, mark
The technical issues of time is long, and accuracy rate cannot be guaranteed is infused, the workload of artificial mark training data is reduced, improves training data
Annotating efficiency and accuracy rate, can quickly train discrimination model using the training data, utilize the discrimination model realize
Fast and accurately data classification.
In another embodiment, passing through labelling function λjTo the data set { xi| i=1,2 ..., m } it is labeled it
Before, the method can also include: to the data set { xi| i=1,2 ..., m } it is pre-processed.
To the data set { xi| i=1,2 ..., m } carry out pretreatment may include filling the data set { xi| i=1,
2 ..., m } in missing values.
K- nearest neighbor algorithm can be used, determines that distance has the nearest K data of data of missing values (such as according to Europe
Formula distance determines that distance has the nearest K data of data of missing values), the numeric weights of K data are averagely estimated this
The missing values of data.
Alternatively, can predict missing values using prediction model, if missing values are numeric types, average value can be used
The missing values are filled, if missing values are non-numeric types, the missing values can be filled using mode.
Alternatively, missing values can be substituted using averaging method.Preferably due to be using the method for averaging method substitution missing values
It establishes on the hypothesis that completely random lacks, the variance and standard deviation that will cause data become smaller, thus, the method can be with
Include: that the Filling power obtained after being substituted by mean value and default sampling factor carry out quadrature, obtains new data as final
Filling power.The default sampling factor is pre-set sampling factor, and the sampling factor is greater than 1.
The missing values can also be filled using other methods.For example, the method or interpolation of regression fit can be passed through
Method is filled missing values.
To the data set { xi| i=1,2 ..., m } to carry out pretreatment can also include correcting the data set { xi| i=
1,2 ..., m } in exceptional value.Exceptional value is the numerical value for deviating considerably from other data.
The method for correcting exceptional value can be the same as the method for filling missing values.For example, K- nearest neighbor algorithm can be used, determine
K nearest data of data of the distance with exceptional value (such as determine that distance has the data of exceptional value most according to Euclidean distance
K close data), by the numeric weights of K data averagely come estimate the data exceptional value correction value.Alternatively, can adopt
With prediction model come forecast value revision value, if exceptional value is numeric type, the exceptional value can be corrected using average value, such as
Fruit exceptional value is non-numeric type, and the exceptional value can be corrected using mode.
Alternatively, exceptional value can be replaced using averaging method.Preferably due to be using the method for averaging method replacement exceptional value
It establishes on the hypothesis that completely random lacks, the variance and standard deviation that will cause data become smaller, thus, the method can be with
Include: that the correction value obtained after being replaced by mean value and default sampling factor carry out quadrature, obtains new numerical value as final
Correction value.The default sampling factor is pre-set sampling factor, and the sampling factor is greater than 1.
The exceptional value can also be corrected using other methods.For example, the method or interpolation of regression fit can be passed through
Method is modified exceptional value.
The method of amendment exceptional value can also be different the method with filling missing values.
Pre-processing to the data set can also include that directly discarding has the data of missing values and/or has exceptional value
Data.Directly the data for having missing values and/or the data for having exceptional value are abandoned, it is ensured that data set it is clean.
Embodiment two
Fig. 2 is the structure chart of device for classifying data provided by Embodiment 2 of the present invention.The device for classifying data 20 is applied
In machine learning system, for generating training data, carried out using discrimination model of the training data to machine learning system
Training, treats classification data using the discrimination model after training and classifies.The device for classifying data 20 can quickly generate
Training data needed for the discrimination model of machine learning system, the artificial mark training data acquisition difficulty of solution is big, label time
The technical issues of length, accuracy rate cannot be guaranteed, reduces the workload of artificial mark training data, improves the mark of training data
Efficiency and accuracy rate can quickly train discrimination model using the training data, be realized using the discrimination model quickly quasi-
True data classification.
As shown in Fig. 2, the device for classifying data 20 may include obtaining module 201, labeling module 202, building module
203, module 204, training module 205, categorization module 206 are estimated.
Module 201 is obtained, for obtaining data set { x to be markedi| i=1,2 ..., m }.
Data set to be marked includes multiple data for needing to be labeled.It is obtained after each data mark in data set
The label of the data, for being trained to discrimination model.
In the application scenarios that the data classification method is applied to image classification, the data set to be marked be can be
Image set.For example, the data set to be marked includes the image of different user, the corresponding user of each image is marked.Again
Such as, the data set to be marked includes the image of different objects (such as pen, ball, book), to mark the corresponding object of each image
Body.For another example, the data set to be marked includes the image of different faces attribute (such as race, sex, age, expression),
Mark the corresponding face character of each image.
In the application scenarios that the data classification method is applied to text classification, the data set to be marked can also be with
It is text set.For example, the data set to be marked includes the text of different emotions tendency, the corresponding feelings of each text are marked
Sense tendency.For another example, the data set to be marked includes the text of different subject matters, to mark the corresponding subject matter of each text.Again
Such as, the data set to be marked includes the text of different technologies field (such as physics, chemistry, machinery), to mark each text
This corresponding technical field.
In the application scenarios that the data classification method is applied to Classification of Speech, the data set to be marked can be with
It is voice collection.For example, the data set to be marked includes the voice of multiple and different users, the corresponding use of each voice is marked
Family.For another example, the data set to be marked includes the voice of multiple and different age bracket users, to mark each voice corresponding year
Age section.For another example, the data set to be marked includes the voice of multiple and different emotions, to mark the corresponding emotion of each voice.
Data set to be marked can be made of the data acquired in real time with real-time data collection.For example, can be with captured in real-time
Character image, using the character image of each captured in real-time as the data set to be marked.
Alternatively, data set to be marked can be obtained from preset data source.For example, can be obtained from preset database
Data constitute data set to be marked, and a large amount of data, such as image can be stored in advance in the presetting database.
Alternatively, can receive the data set to be marked of user's input.For example, can receive multiple figures of user's input
Picture, using the multiple images of user's input as the data set to be marked.
Labeling module 202, for passing through labelling function λjTo the data set { xi| i=1,2 ..., m } it is labeled, it obtains
To the data set { xi| i=1,2 ..., m } initial labels ΛI, j=λj(xi), wherein i=1,2 ..., m, j=1,
2 ..., n.
Labelling function is the function for indicating data Yu label mapping relationship, and labelling function receives data and exports the data
Label.Labelling function is black box function, can be expressed as λ:Wherein λ indicates that labelling function, X indicate number
According to, Y indicates the corresponding initial labels of X,Indicate labelling function abstention.
Compared with manually mark training data, labelling function allows to utilize various Weakly supervised source-information (such as heuristic letters
Breath, external knowledge library etc.) generate the initial labels.For example, including two people of first and second in image, to mark two people's
Relationship, it is known that first is the father of fourth, and second is mother of fourth, then " A is the father of C, and B is mother-> A of C according to heuristic information
It is man and wife with B ", then it obtains first and second is the annotation results (i.e. initial labels) of man and wife.
Labelling function does not require accuracy rate.That is, being insecure according to the initial labels that labelling function obtains.No
It reliably may include marking incorrect, a variety of marks, marking insufficient, part mark etc..
It can be previously according to needing to define multiple labelling functions, such as define 6 labelling functions.
Different labelling functions allows to conflict for the annotation results of the same data.For example, 1 labeled data of labelling function
For man and wife, 2 labeled data of labelling function is brother and sister.
Module 203 is constructed, for calculating the mark function lambda according to the initial labelsj, j=1,2 ..., n's is pairs of
Correlation constructs the labelling function λ according to the pairs of correlationj, the generation model of j=1,2 ..., n.
Labelling function λj, j=1, the pairs of correlation of 2 ..., n refers to the dependence between two labelling functions.This
Statistics dependence between Method Modeling labelling function, to improve estimated performance.For example, if two labelling functions express class
As heuristic information, can be in generating model comprising this dependence and avoid " computing repeatedly " problem.This pairs of correlation
Property be most common, thus select labelling function the set C of (j, k) is modeled as it is relevant.
In the present embodiment, labelling function λj, the pairs of correlation of j=1,2 ..., n can indicate are as follows:
Wherein, Λ is the initial labels matrix that the initial labels are constituted, ΛI, j=λj(xi);Y is true tag matrix, Y
=(y1, y2..., ym);C is set of the labelling function to (i, j).II{ΛI, j=ΛI, kIndicate condition within bracket { } at
Value when vertical and invalid.In the present embodiment, when the condition in bracket { } is set up, value is 1, the item within bracket { }
Value is O when part is invalid.
The labelling function λ can be constructed according to the pairs of correlationj, the generation model of j=1,2 ..., n.
The core operation of this method is that the noise signal provided one group of labelling function is modeled and integrated, by each mark
Note function modelling is a noise " voter ", generates mistake relevant to other labelling functions.
For data set { xi| i=1,2 ..., m } in each data xi, initial labels vector is Λi=(ΛI, 1,
ΛI, 2,..., ΛI, n).In the present embodiment, according to the generation model of the pairs of correlation building are as follows:
Wherein ZwIt is normaliztion constant, φi(Λ, yi) be for each data in the data set the mark letter
Number λj, the pairs of correlation of j=1,2 ..., n, w is the undetermined parameter for generating model, w ∈ R2n+|C|。
Module 204 is estimated, the data set { x according to the generation model pre-estimating is used fori| i=1,2 ..., m } probability
Label
In the present embodiment, model is generated are as follows:
To learn the model in the case where not accessing true tag, it can be in the given initial labels matrix Λ observed
In the case where minimize negative logarithm marginal probability:
It can be generated by staggeredly executing stochastic gradient descent step and this target of gibbs sampler optimization order
The estimated value of the parameter w of model
The estimated valueThe generation model determines that after determination, and initial labels matrix Λ is inputted the generation mould
Data set { the x can be obtained in typei| i=1,2 ..., m } probability tag
Constructing module 203, estimating module 204 is exactly using generation model to the initial labels ΛI, j=λj(xi) carry out
Denoising, obtains the data set { xi| i=1,2 ..., m } probability tag.The data set with the probability tag is
The training data of the obtained machine learning system.
Training module 205, for being trained according to discrimination model of the probability tag to the machine learning system,
Discrimination model after being trained.
It is trained according to discrimination model of the probability tag to the machine learning system, being exactly will be with described general
The data set of rate label is trained the discrimination model as training sample.
According to the probability tagDiscrimination model is trained, final goal is training one beyond labelling function institute
The discrimination model of expressing information.Loss function l (the h of the minimum discrimination model can be passed throughθ(xi), y) noise perception become
Amount(i.e. relative toExpected loss) in probability tagUpper trained discrimination model hθ:
When being trained to the discrimination model, the parameter of discrimination model is adjusted, obtains the noise perception variable
Minimum value.Trained process can use RMSprop algorithm.RMSprop is a kind of improved stochastic gradient descent algorithm.
RMSprop algorithm is techniques well known, and details are not described herein again.
Categorization module 206 obtains the number to be sorted for data to be sorted to be inputted to the discrimination model after the training
According to classification.
In the application scenarios that the data classification method is applied to image classification, image to be classified is inputted into the training
Discrimination model afterwards obtains the classification of the image to be classified.For example, obtaining the corresponding user of the image to be classified.For another example,
Obtain the corresponding object of the image to be classified.For another example, the corresponding face character of the image to be classified is obtained.
In the application scenarios that the data classification method is applied to text classification, by training described in text input to be sorted
Discrimination model afterwards obtains the classification of the text to be sorted.For example, obtain the text to be sorted Sentiment orientation (such as
Positive emotion tendency or negative emotion tendency).For another example, the corresponding subject matter of the text to be sorted is obtained.For another example, obtain it is described to
The technical field of classifying text.
It, will be described in voice to be sorted input in the application scenarios that the training data generation method is applied to Classification of Speech
Discrimination model after training obtains the classification of the voice to be sorted.For example, obtaining the corresponding user of the voice to be sorted.
For another example, the corresponding age bracket of the voice to be sorted is obtained.For another example, the emotion of the voice to be sorted is obtained.
The device for classifying data 20 of embodiment two obtains data set { x to be markedi| i=1,2 ..., m };Pass through mark
Function lambdaj, j=1,2 ..., n be labeled the data set, obtains the initial labels Λ of the data setI, j=λj(xi), i
=1,2 ..., m, j=1,2 ..., n;The pairs of correlation that the mark function is calculated according to the initial labels, according to institute
State the generation model that pairs of correlation constructs the labelling function;According to the probability mark of data set described in the generation model pre-estimating
Label;It is trained according to discrimination model of the probability tag to the machine learning system, the discrimination model after being trained;
Data to be sorted are inputted into the discrimination model after the training, obtain the classification of the data to be sorted.The present embodiment can be fast
Training data needed for discrimination model of the fast-growing at machine learning system, the artificial mark training data acquisition difficulty of solution is big, mark
The technical issues of time is long, and accuracy rate cannot be guaranteed is infused, the workload of artificial mark training data is reduced, improves training data
Annotating efficiency and accuracy rate, can quickly train discrimination model using the training data, utilize the discrimination model realize
Fast and accurately data classification.
In another embodiment, the device for classifying data 20 can also include: preprocessing module, for the data
Collect { xi| i=1,2 ..., m } it is pre-processed.
To the data set { xi| i=1,2 ..., m } carry out pretreatment may include filling the data set { xi| i=1,
2 ..., m } in missing values.
K- nearest neighbor algorithm can be used, determines that distance has the nearest K data of data of missing values (such as according to Europe
Formula distance determines that distance has the nearest K data of data of missing values), the numeric weights of K data are averagely estimated this
The missing values of data.
Alternatively, can predict missing values using prediction model, if missing values are numeric types, average value can be used
The missing values are filled, if missing values are non-numeric types, the missing values can be filled using mode.
Alternatively, missing values can be substituted using averaging method.Preferably due to be using the method for averaging method substitution missing values
It establishes on the hypothesis that completely random lacks, the variance and standard deviation that will cause data become smaller, thus, the method can be with
Include: that the Filling power obtained after being substituted by mean value and default sampling factor carry out quadrature, obtains new data as final
Filling power.The default sampling factor is pre-set sampling factor, and the sampling factor is greater than 1.
The missing values can also be filled using other methods.For example, the method or interpolation of regression fit can be passed through
Method is filled missing values.
To the data set { xi| i=1,2 ..., m } to carry out pretreatment can also include correcting the data set { xi| i=
1,2 ..., m } in exceptional value.Exceptional value is the numerical value for deviating considerably from other data.
The method for correcting exceptional value can be the same as the method for filling missing values.For example, K- nearest neighbor algorithm can be used, determine
K nearest data of data of the distance with exceptional value (such as determine that distance has the data of exceptional value most according to Euclidean distance
K close data), by the numeric weights of K data averagely come estimate the data exceptional value correction value.Alternatively, can adopt
With prediction model come forecast value revision value, if exceptional value is numeric type, the exceptional value can be corrected using average value, such as
Fruit exceptional value is non-numeric type, and the exceptional value can be corrected using mode.
Alternatively, exceptional value can be replaced using averaging method.Preferably due to be using the method for averaging method replacement exceptional value
It establishes on the hypothesis that completely random lacks, the variance and standard deviation that will cause data become smaller, thus, the method can be with
Include: that the correction value obtained after being replaced by mean value and default sampling factor carry out quadrature, obtains new numerical value as final
Correction value.The default sampling factor is pre-set sampling factor, and the sampling factor is greater than 1.
The exceptional value can also be corrected using other methods.For example, the method or interpolation of regression fit can be passed through
Method is modified exceptional value.
The method of amendment exceptional value can also be different the method with filling missing values.
Pre-processing to the data set can also include that directly discarding has the data of missing values and/or has exceptional value
Data.Directly the data for having missing values and/or the data for having exceptional value are abandoned, it is ensured that data set it is clean.
Embodiment three
The present embodiment provides a kind of computer storage medium, it is stored with computer program in the computer storage medium, it should
The step in above-mentioned data classification method embodiment, such as step shown in FIG. 1 are realized when computer program is executed by processor
101-106:
Step 101, data set { x to be marked is obtainedi| i=1,2 ..., m };
Step 102, pass through labelling function λj, j=1,2 ..., n be labeled the data set, obtains the data
The initial labels Λ of collectionI, j=λj(xi), i=1,2 ..., m, j=1,2 ..., n;
Step 103, the pairs of correlation that the mark function is calculated according to the initial labels, according to the pairs of correlation
Property constructs the generation model of the labelling function;
Step 104, the probability tag of the data set according to the generation model pre-estimating;
Step 105, it is trained, is trained according to discrimination model of the probability tag to the machine learning system
Discrimination model afterwards;
Step 106, data to be sorted are inputted into the discrimination model after the training, obtains the class of the data to be sorted
Not.
Alternatively, the function of each module in above-mentioned apparatus embodiment is realized when the computer program is executed by processor, such as
Module 201-206 in Fig. 2:
Module 201 is obtained, for obtaining data set { x to be markedi| i=1,2 ..., m };
Labeling module 202, for passing through labelling function λj, j=1,2 ..., n are labeled the data set, obtain
The initial labels Λ of the data setI, j=λj(xi), i=1,2 ..., m, j=1,2 ..., n;
Module 203 is constructed, for calculating the pairs of correlation of the mark function according to the initial labels, according to described
Pairs of correlation constructs the generation model of the labelling function;
Module 204 is estimated, the probability tag for the data set according to the generation model pre-estimating;
Training module 205, for being trained according to discrimination model of the probability tag to the machine learning system,
Discrimination model after being trained;
Categorization module 206 obtains the number to be sorted for data to be sorted to be inputted to the discrimination model after the training
According to classification.
Example IV
Fig. 3 is the schematic diagram for the computer installation that the embodiment of the present invention four provides.The computer installation 30 includes storage
Device 301, processor 302 and it is stored in the computer program that can be run in the memory 301 and on the processor 302
303, such as data classification program.The processor 302 realizes above-mentioned data classification method when executing the computer program 303
Step in embodiment, such as step 101-106 shown in FIG. 1:
Step 101, data set { x to be marked is obtainedi| i=1,2 ..., m };
Step 102, pass through labelling function λj, j=1,2 ..., n be labeled the data set, obtains the data
The initial labels Λ of collectionI, j=λj(xi), i=1,2 ..., m, j=1,2 ..., n;
Step 103, the pairs of correlation that the mark function is calculated according to the initial labels, according to the pairs of correlation
Property constructs the generation model of the labelling function;
Step 104, the probability tag of the data set according to the generation model pre-estimating;
Step 105, it is trained, is trained according to discrimination model of the probability tag to the machine learning system
Discrimination model afterwards;
Step 106, data to be sorted are inputted into the discrimination model after the training, obtains the class of the data to be sorted
Not.
Alternatively, the function of each module in above-mentioned apparatus embodiment is realized when the computer program is executed by processor, such as
Module 201-206 in Fig. 2:
Module 201 is obtained, for obtaining data set { x to be markedi| i=1,2 ..., m };
Labeling module 202, for passing through labelling function λj, j=1,2 ..., n are labeled the data set, obtain
The initial labels Λ of the data setI, j=λj(xi), i=1,2 ..., m, j=1,2 ..., n;
Module 203 is constructed, for calculating the pairs of correlation of the mark function according to the initial labels, according to described
Pairs of correlation constructs the generation model of the labelling function;
Module 204 is estimated, the probability tag for the data set according to the generation model pre-estimating;
Training module 205, for being trained according to discrimination model of the probability tag to the machine learning system,
Discrimination model after being trained;
Categorization module 206 obtains the number to be sorted for data to be sorted to be inputted to the discrimination model after the training
According to classification.
Illustratively, the computer program 303 can be divided into one or more modules, one or more of
Module is stored in the memory 301, and is executed by the processor 302, to complete this method.It is one or more of
Module can be the series of computation machine program instruction section that can complete specific function, and the instruction segment is for describing the computer
Implementation procedure of the program 303 in the computer installation 30.For example, the computer program 303 can be divided into Fig. 2
Acquisition module 201, labeling module 202, building module 203, estimate module 204, training module 205, categorization module 206, respectively
Module concrete function is referring to embodiment two.
The computer installation 30 can be the calculating such as desktop PC, notebook, palm PC and cloud server
Equipment.It will be understood by those skilled in the art that the schematic diagram 3 is only the example of computer installation 30, do not constitute to meter
The restriction of calculation machine device 30 may include perhaps combining certain components or different portions than illustrating more or fewer components
Part, such as the computer installation 30 can also include input-output equipment, network access equipment, bus etc..
Alleged processor 302 can be central processing unit (Central Processing Unit, CPU), can also be
Other general processors, digital signal processor (Digital Signal Processor, DSP), specific integrated circuit
(Application Specific Integrated Circuit, ASIC), ready-made programmable gate array (Field-
Programmable Gate Array, FPGA) either other programmable logic device, discrete gate or transistor logic,
Discrete hardware components etc..General processor can be microprocessor or the processor 302 is also possible to any conventional processing
Device etc., the processor 302 are the control centres of the computer installation 30, are entirely calculated using various interfaces and connection
The various pieces of machine device 30.
The memory 301 can be used for storing the computer program 303, and the processor 302 is by operation or executes
The computer program or module being stored in the memory 301, and the data being stored in memory 301 are called, it realizes
The various functions of the computer installation 30.The memory 301 can mainly include storing program area and storage data area,
In, storing program area can application program needed for storage program area, at least one function (such as sound-playing function, image
Playing function etc.) etc.;Storage data area, which can be stored, uses created data (such as audio number according to computer installation 30
According to, phone directory etc.) etc..In addition, memory 301 may include high-speed random access memory, it can also include non-volatile deposit
Reservoir, such as hard disk, memory, plug-in type hard disk, intelligent memory card (Smart Media Card, SMC), secure digital
(Secure Digital, SD) card, flash card (Flash Card), at least one disk memory, flush memory device or other
Volatile solid-state part.
If the integrated module of the computer installation 30 is realized in the form of software function module and as independent production
Product when selling or using, can store in a computer storage medium.Based on this understanding, the present invention realizes above-mentioned reality
The all or part of the process in a method is applied, relevant hardware can also be instructed to complete by computer program, it is described
Computer program can be stored in a computer storage medium, and the computer program is when being executed by processor, it can be achieved that above-mentioned
The step of each embodiment of the method.Wherein, the computer program includes computer program code, the computer program code
It can be source code form, object identification code form, executable file or certain intermediate forms etc..The computer-readable medium can
With include: can carry the computer program code any entity or device, recording medium, USB flash disk, mobile hard disk, magnetic disk,
CD, computer storage, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random
Access Memory), electric carrier signal, telecommunication signal and software distribution medium etc..It should be noted that the computer
The content that readable medium includes can carry out increase and decrease appropriate according to the requirement made laws in jurisdiction with patent practice, such as
It does not include electric carrier signal and telecommunication signal according to legislation and patent practice, computer-readable medium in certain jurisdictions.
In several embodiments provided by the present invention, it should be understood that disclosed system, device and method can be with
It realizes by another way.For example, the apparatus embodiments described above are merely exemplary, for example, the module
It divides, only a kind of logical function partition, there may be another division manner in actual implementation.
The module as illustrated by the separation member may or may not be physically separated, aobvious as module
The component shown may or may not be physical module, it can and it is in one place, or may be distributed over multiple
In network unit.Some or all of the modules therein can be selected to realize the mesh of this embodiment scheme according to the actual needs
's.
It, can also be in addition, each functional module in each embodiment of the present invention can integrate in a processing module
It is that modules physically exist alone, can also be integrated in two or more modules in a module.Above-mentioned integrated mould
Block both can take the form of hardware realization, can also realize in the form of hardware adds software function module.
The above-mentioned integrated module realized in the form of software function module, can store in a computer storage medium
In.Above-mentioned software function module is stored in a storage medium, including some instructions are used so that a computer equipment (can
To be personal computer, server or the network equipment etc.) or each embodiment of processor (processor) the execution present invention
The part steps of the method.
It is obvious to a person skilled in the art that invention is not limited to the details of the above exemplary embodiments, Er Qie
In the case where without departing substantially from spirit or essential attributes of the invention, the present invention can be realized in other specific forms.Therefore, no matter
From the point of view of which point, the present embodiments are to be considered as illustrative and not restrictive, and the scope of the present invention is by appended power
Benefit requires rather than above description limits, it is intended that all by what is fallen within the meaning and scope of the equivalent elements of the claims
Variation is included in the present invention.Any attached associated diagram label in claim should not be considered as right involved in limitation to want
It asks.Furthermore, it is to be understood that one word of " comprising " is not excluded for other modules or step, odd number is not excluded for plural number.It is stated in system claims
Multiple modules or device can also be implemented through software or hardware by a module or device.The first, the second equal words
It is used to indicate names, and does not indicate any particular order.
Finally it should be noted that the above examples are only used to illustrate the technical scheme of the present invention and are not limiting, although reference
Preferred embodiment describes the invention in detail, those skilled in the art should understand that, it can be to of the invention
Technical solution is modified or equivalent replacement, without departing from the spirit and scope of the technical solution of the present invention.
Claims (10)
1. a kind of data classification method is applied to machine learning system, which is characterized in that the described method includes:
Obtain data set { x to be markedi| i=1,2 ..., m };
Pass through labelling function λj, j=1,2 ..., n be labeled the data set, obtains the initial labels of the data set
Λi,j=λj(xi), i=1,2 ..., m, j=1,2 ..., n;
The pairs of correlation that the mark function is calculated according to the initial labels constructs the mark according to the pairs of correlation
Infuse the generation model of function;
According to the probability tag of data set described in the generation model pre-estimating;
It is trained according to discrimination model of the probability tag to the machine learning system, the differentiation mould after being trained
Type;
Data to be sorted are inputted into the discrimination model after the training, obtain the classification of the data to be sorted.
2. the method as described in claim 1, which is characterized in that the generation model are as follows:
Wherein Λ indicates that the initial labels matrix that the initial labels are constituted, Y indicate true tag matrix, ZwFor normaliztion constant,
φi(Λ,yi), i=1,2 ..., m are the pairs of correlation for the labelling function of each data in the data set, w
For the undetermined parameter for generating model, w ∈ R2n+|C|。
3. method according to claim 2, which is characterized in that the pairs of correlation are as follows:
WhereinIndicate the value when the condition in bracket { } is set up and is invalid.
4. the method as described in claim 1, it is characterised in that:
The data set to be marked is image set, and the data to be sorted are images to be classified;Or
The data set to be marked is text set, and the data to be sorted are texts to be sorted;Or
The data set to be marked is voice collection, and the data to be sorted are voices to be sorted.
5. method as claimed in claim 4, which is characterized in that described that data to be sorted are inputted to the differentiation mould after the training
Type, the classification for obtaining the data to be sorted include:
The image to be classified is inputted into the discrimination model after the training, obtains the corresponding user of the image to be classified, object
Body or face character;
By the discrimination model after training described in the text input to be sorted, obtains the corresponding emotion of the text to be sorted and incline
To, subject matter or technical field;
The voice to be sorted is inputted into the discrimination model after the training, obtains the corresponding user of the voice to be sorted, year
Age section or emotion.
6. the method as described in claim 1, which is characterized in that it is described according to the probability tag to the machine learning system
Discrimination model be trained and include:
Noise perception variable by minimizing the loss function of the discrimination model is sentenced described in training on the probability tag
Other model.
7. such as method of any of claims 1-6, which is characterized in that described to pass through labelling function λj, j=1,2 ...,
Before n is labeled the data set, the method also includes:
Fill the missing values in the data set;And/or
Correct the exceptional value in the data set.
8. a kind of device for classifying data, it is applied to machine learning system, which is characterized in that described device includes:
Module is obtained, for obtaining data set { x to be markedi| i=1,2 ..., m };
Labeling module, for passing through labelling function λj, j=1,2 ..., n be labeled the data set, obtains the data set
Initial labels Λi,j=λj(xi), i=1,2 ..., m, j=1,2 ..., n;
Module is constructed, for calculating the pairs of correlation of the mark function according to the initial labels, according to the pairs of phase
Closing property constructs the generation model of the labelling function;
Module is estimated, the probability tag for the data set according to the generation model pre-estimating;
Training module is instructed for being trained according to discrimination model of the probability tag to the machine learning system
Discrimination model after white silk;
Categorization module obtains the class of the data to be sorted for data to be sorted to be inputted to the discrimination model after the training
Not.
9. a kind of computer installation, it is characterised in that: the computer installation includes processor, and the processor is deposited for executing
The computer program stored in reservoir is to realize the data classification method as described in any one of claim 1-7.
10. a kind of computer storage medium, computer program is stored in the computer storage medium, it is characterised in that: institute
It states and realizes the data classification method as described in any one of claim 1-7 when computer program is executed by processor.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910310574.1A CN110196908A (en) | 2019-04-17 | 2019-04-17 | Data classification method, device, computer installation and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910310574.1A CN110196908A (en) | 2019-04-17 | 2019-04-17 | Data classification method, device, computer installation and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110196908A true CN110196908A (en) | 2019-09-03 |
Family
ID=67752025
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910310574.1A Pending CN110196908A (en) | 2019-04-17 | 2019-04-17 | Data classification method, device, computer installation and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110196908A (en) |
Cited By (20)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110648325A (en) * | 2019-09-29 | 2020-01-03 | 北部湾大学 | Nixing pottery big data analysis genuine product checking system |
CN111291823A (en) * | 2020-02-24 | 2020-06-16 | 腾讯科技(深圳)有限公司 | Fusion method and device of classification models, electronic equipment and storage medium |
CN111582360A (en) * | 2020-05-06 | 2020-08-25 | 北京字节跳动网络技术有限公司 | Method, apparatus, device and medium for labeling data |
CN111860572A (en) * | 2020-06-04 | 2020-10-30 | 北京百度网讯科技有限公司 | Data set distillation method, device, electronic equipment and storage medium |
CN111967541A (en) * | 2020-10-21 | 2020-11-20 | 上海冰鉴信息科技有限公司 | Data classification method and device based on multi-platform samples |
CN112102062A (en) * | 2020-07-24 | 2020-12-18 | 北京淇瑀信息科技有限公司 | Risk assessment method and device based on weak supervised learning and electronic equipment |
CN112115267A (en) * | 2020-09-28 | 2020-12-22 | 平安科技(深圳)有限公司 | Training method, device and equipment of text classification model and storage medium |
CN112199502A (en) * | 2020-10-26 | 2021-01-08 | 网易(杭州)网络有限公司 | Emotion-based poetry sentence generation method and device, electronic equipment and storage medium |
CN112488141A (en) * | 2019-09-12 | 2021-03-12 | 中移(苏州)软件技术有限公司 | Method and device for determining application range of Internet of things card and computer readable storage medium |
CN112529024A (en) * | 2019-09-17 | 2021-03-19 | 株式会社理光 | Sample data generation method and device and computer readable storage medium |
CN112651447A (en) * | 2020-12-29 | 2021-04-13 | 广东电网有限责任公司电力调度控制中心 | Resource classification labeling method and system based on ontology |
CN112825144A (en) * | 2019-11-20 | 2021-05-21 | 深圳云天励飞技术有限公司 | Picture labeling method and device, electronic equipment and storage medium |
CN112860919A (en) * | 2021-02-20 | 2021-05-28 | 平安科技(深圳)有限公司 | Data labeling method, device and equipment based on generative model and storage medium |
CN112925958A (en) * | 2021-02-05 | 2021-06-08 | 深圳力维智联技术有限公司 | Multi-source heterogeneous data adaptation method, device, equipment and readable storage medium |
CN113344916A (en) * | 2021-07-21 | 2021-09-03 | 上海媒智科技有限公司 | Method, system, terminal, medium and application for acquiring machine learning model capability |
CN113590677A (en) * | 2021-07-14 | 2021-11-02 | 上海淇玥信息技术有限公司 | Data processing method and device and electronic equipment |
CN113761925A (en) * | 2021-07-23 | 2021-12-07 | 中国科学院自动化研究所 | Named entity identification method, device and equipment based on noise perception mechanism |
CN113836118A (en) * | 2021-11-24 | 2021-12-24 | 亿海蓝(北京)数据技术股份公司 | Ship static data supplementing method and device, electronic equipment and readable storage medium |
CN114064973A (en) * | 2022-01-11 | 2022-02-18 | 人民网科技(北京)有限公司 | Video news classification model establishing method, classification method, device and equipment |
CN112825144B (en) * | 2019-11-20 | 2024-06-07 | 深圳云天励飞技术有限公司 | Picture marking method and device, electronic equipment and storage medium |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104463202A (en) * | 2014-11-28 | 2015-03-25 | 苏州大学 | Multi-class image semi-supervised classifying method and system |
CN107220600A (en) * | 2017-05-17 | 2017-09-29 | 清华大学深圳研究生院 | A kind of Picture Generation Method and generation confrontation network based on deep learning |
CN108229543A (en) * | 2017-12-22 | 2018-06-29 | 中国科学院深圳先进技术研究院 | Image classification design methods and device |
US20190080164A1 (en) * | 2017-09-14 | 2019-03-14 | Chevron U.S.A. Inc. | Classification of character strings using machine-learning |
-
2019
- 2019-04-17 CN CN201910310574.1A patent/CN110196908A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104463202A (en) * | 2014-11-28 | 2015-03-25 | 苏州大学 | Multi-class image semi-supervised classifying method and system |
CN107220600A (en) * | 2017-05-17 | 2017-09-29 | 清华大学深圳研究生院 | A kind of Picture Generation Method and generation confrontation network based on deep learning |
US20190080164A1 (en) * | 2017-09-14 | 2019-03-14 | Chevron U.S.A. Inc. | Classification of character strings using machine-learning |
CN108229543A (en) * | 2017-12-22 | 2018-06-29 | 中国科学院深圳先进技术研究院 | Image classification design methods and device |
Non-Patent Citations (2)
Title |
---|
KAI FAN等: "Learning a generative classifier from label proportions", 《NEUROCOMPUTING》, pages 47 - 55 * |
蒋俊钊;程良伦;李全杰;: "基于标签相关性的卷积神经网络多标签分类算法", 工业控制计算机, no. 07, pages 108 - 109 * |
Cited By (33)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112488141B (en) * | 2019-09-12 | 2023-04-07 | 中移(苏州)软件技术有限公司 | Method and device for determining application range of Internet of things card and computer readable storage medium |
CN112488141A (en) * | 2019-09-12 | 2021-03-12 | 中移(苏州)软件技术有限公司 | Method and device for determining application range of Internet of things card and computer readable storage medium |
CN112529024A (en) * | 2019-09-17 | 2021-03-19 | 株式会社理光 | Sample data generation method and device and computer readable storage medium |
CN110648325B (en) * | 2019-09-29 | 2022-05-17 | 北部湾大学 | Nixing pottery big data analysis genuine product checking system |
CN110648325A (en) * | 2019-09-29 | 2020-01-03 | 北部湾大学 | Nixing pottery big data analysis genuine product checking system |
CN112825144A (en) * | 2019-11-20 | 2021-05-21 | 深圳云天励飞技术有限公司 | Picture labeling method and device, electronic equipment and storage medium |
CN112825144B (en) * | 2019-11-20 | 2024-06-07 | 深圳云天励飞技术有限公司 | Picture marking method and device, electronic equipment and storage medium |
CN111291823A (en) * | 2020-02-24 | 2020-06-16 | 腾讯科技(深圳)有限公司 | Fusion method and device of classification models, electronic equipment and storage medium |
CN111291823B (en) * | 2020-02-24 | 2023-08-18 | 腾讯科技(深圳)有限公司 | Fusion method and device of classification model, electronic equipment and storage medium |
CN111582360A (en) * | 2020-05-06 | 2020-08-25 | 北京字节跳动网络技术有限公司 | Method, apparatus, device and medium for labeling data |
CN111582360B (en) * | 2020-05-06 | 2023-08-15 | 北京字节跳动网络技术有限公司 | Method, apparatus, device and medium for labeling data |
CN111860572B (en) * | 2020-06-04 | 2024-01-26 | 北京百度网讯科技有限公司 | Data set distillation method, device, electronic equipment and storage medium |
CN111860572A (en) * | 2020-06-04 | 2020-10-30 | 北京百度网讯科技有限公司 | Data set distillation method, device, electronic equipment and storage medium |
CN112102062A (en) * | 2020-07-24 | 2020-12-18 | 北京淇瑀信息科技有限公司 | Risk assessment method and device based on weak supervised learning and electronic equipment |
CN112115267A (en) * | 2020-09-28 | 2020-12-22 | 平安科技(深圳)有限公司 | Training method, device and equipment of text classification model and storage medium |
CN112115267B (en) * | 2020-09-28 | 2023-07-07 | 平安科技(深圳)有限公司 | Training method, device, equipment and storage medium of text classification model |
CN111967541A (en) * | 2020-10-21 | 2020-11-20 | 上海冰鉴信息科技有限公司 | Data classification method and device based on multi-platform samples |
CN111967541B (en) * | 2020-10-21 | 2021-01-05 | 上海冰鉴信息科技有限公司 | Data classification method and device based on multi-platform samples |
CN112199502B (en) * | 2020-10-26 | 2024-03-15 | 网易(杭州)网络有限公司 | Verse generation method and device based on emotion, electronic equipment and storage medium |
CN112199502A (en) * | 2020-10-26 | 2021-01-08 | 网易(杭州)网络有限公司 | Emotion-based poetry sentence generation method and device, electronic equipment and storage medium |
CN112651447B (en) * | 2020-12-29 | 2023-09-26 | 广东电网有限责任公司电力调度控制中心 | Ontology-based resource classification labeling method and system |
CN112651447A (en) * | 2020-12-29 | 2021-04-13 | 广东电网有限责任公司电力调度控制中心 | Resource classification labeling method and system based on ontology |
CN112925958A (en) * | 2021-02-05 | 2021-06-08 | 深圳力维智联技术有限公司 | Multi-source heterogeneous data adaptation method, device, equipment and readable storage medium |
WO2022174496A1 (en) * | 2021-02-20 | 2022-08-25 | 平安科技(深圳)有限公司 | Data annotation method and apparatus based on generative model, and device and storage medium |
CN112860919A (en) * | 2021-02-20 | 2021-05-28 | 平安科技(深圳)有限公司 | Data labeling method, device and equipment based on generative model and storage medium |
CN113590677A (en) * | 2021-07-14 | 2021-11-02 | 上海淇玥信息技术有限公司 | Data processing method and device and electronic equipment |
CN113344916A (en) * | 2021-07-21 | 2021-09-03 | 上海媒智科技有限公司 | Method, system, terminal, medium and application for acquiring machine learning model capability |
CN113761925B (en) * | 2021-07-23 | 2022-10-28 | 中国科学院自动化研究所 | Named entity identification method, device and equipment based on noise perception mechanism |
CN113761925A (en) * | 2021-07-23 | 2021-12-07 | 中国科学院自动化研究所 | Named entity identification method, device and equipment based on noise perception mechanism |
CN113836118B (en) * | 2021-11-24 | 2022-03-08 | 亿海蓝(北京)数据技术股份公司 | Ship static data supplementing method and device, electronic equipment and readable storage medium |
CN113836118A (en) * | 2021-11-24 | 2021-12-24 | 亿海蓝(北京)数据技术股份公司 | Ship static data supplementing method and device, electronic equipment and readable storage medium |
CN114064973B (en) * | 2022-01-11 | 2022-05-03 | 人民网科技(北京)有限公司 | Video news classification model establishing method, classification method, device and equipment |
CN114064973A (en) * | 2022-01-11 | 2022-02-18 | 人民网科技(北京)有限公司 | Video news classification model establishing method, classification method, device and equipment |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110196908A (en) | Data classification method, device, computer installation and storage medium | |
CN113822494B (en) | Risk prediction method, device, equipment and storage medium | |
CN110019770A (en) | The method and apparatus of train classification models | |
CN111324696B (en) | Entity extraction method, entity extraction model training method, device and equipment | |
WO2021073390A1 (en) | Data screening method and apparatus, device and computer-readable storage medium | |
CN109599187A (en) | A kind of online interrogation point examines method, server, terminal, equipment and medium | |
CN109902672A (en) | Image labeling method and device, storage medium, computer equipment | |
CN110442859A (en) | Method, device and equipment for generating labeled corpus and storage medium | |
CN110458600A (en) | Portrait model training method, device, computer equipment and storage medium | |
CN107807958A (en) | A kind of article list personalized recommendation method, electronic equipment and storage medium | |
WO2023029507A1 (en) | Data analysis-based service distribution method and apparatus, device, and storage medium | |
CN110807086A (en) | Text data labeling method and device, storage medium and electronic equipment | |
CA3149895A1 (en) | Machine learning system for summarizing tax documents with non-structured portions | |
CN110008365A (en) | A kind of image processing method, device, equipment and readable storage medium storing program for executing | |
JP7347179B2 (en) | Methods, devices and computer programs for extracting web page content | |
CN112287656A (en) | Text comparison method, device, equipment and storage medium | |
CN104077408B (en) | Extensive across media data distributed semi content of supervision method for identifying and classifying and device | |
CN108735198A (en) | Phoneme synthesizing method, device based on medical conditions data and electronic equipment | |
CN113723077B (en) | Sentence vector generation method and device based on bidirectional characterization model and computer equipment | |
CN112101488B (en) | Training method and device for machine learning model and storage medium | |
CN113837307A (en) | Data similarity calculation method and device, readable medium and electronic equipment | |
CN108629381A (en) | Crowd's screening technique based on big data and terminal device | |
CN109657710B (en) | Data screening method and device, server and storage medium | |
CN116978087A (en) | Model updating method, device, equipment, storage medium and program product | |
WO2021135330A1 (en) | Image sample selection method and related apparatus |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |