CN110196908A - Data classification method, device, computer installation and storage medium - Google Patents

Data classification method, device, computer installation and storage medium Download PDF

Info

Publication number
CN110196908A
CN110196908A CN201910310574.1A CN201910310574A CN110196908A CN 110196908 A CN110196908 A CN 110196908A CN 201910310574 A CN201910310574 A CN 201910310574A CN 110196908 A CN110196908 A CN 110196908A
Authority
CN
China
Prior art keywords
data
data set
sorted
training
discrimination model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910310574.1A
Other languages
Chinese (zh)
Inventor
刘康龙
徐国强
邱寒
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
OneConnect Smart Technology Co Ltd
Original Assignee
OneConnect Smart Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by OneConnect Smart Technology Co Ltd filed Critical OneConnect Smart Technology Co Ltd
Priority to CN201910310574.1A priority Critical patent/CN110196908A/en
Publication of CN110196908A publication Critical patent/CN110196908A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/55Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention provides a kind of data classification method, device, computer installation and storage medium.The described method includes: obtaining data set to be marked;The data set is labeled by labelling function, obtains the initial labels of the data set;The pairs of correlation that the mark function is calculated according to the initial labels, the generation model of the labelling function is constructed according to the pairs of correlation;According to the probability tag of data set described in the generation model pre-estimating;Discrimination model is trained according to the probability tag, the discrimination model after being trained;Data to be sorted are inputted into the discrimination model after the training, obtain the classification of the data to be sorted.The present invention improves the annotating efficiency and accuracy rate of training data, can quickly train discrimination model using the training data, realize fast and accurately data classification using the discrimination model.

Description

Data classification method, device, computer installation and storage medium
Technical field
The present invention relates to machine learning techniques fields, and in particular to a kind of data classification method, device, computer installation and Computer storage medium.
Background technique
With the fast development of artificial intelligence, machine learning techniques (especially depth learning technology) have been had been applied in respectively In a industry.At this point, training data mark has been increasingly becoming the maximum bottleneck of widespread deployment machine learning system.
Existing traditional artificial mask method takes time and effort and cost is quite high, and existing data enhancement methods such as half are supervised The methods of educational inspector's habit, Active Learning and transfer learning can not quickly generate training data on a large scale.
Suitable scheme how is formulated, the workload of artificial mark training data is reduced, improves the mark effect of training data Rate is the current technical issues that need to address of related technical personnel.
Summary of the invention
In view of the foregoing, it is necessary to propose that a kind of data classification method, device, computer installation and computer storage are situated between Matter can be improved the annotating efficiency of training data, rapidly and accurately classify to data.
The first aspect of the application provides a kind of data classification method, is applied to machine learning system, which comprises
Obtain data set { x to be markedi| i=1,2 ..., m };
Pass through labelling function λj, j=1,2 ..., n be labeled the data set, obtains the initial of the data set Label ΛI, jj(xi), i=1,2 ..., m, j=1,2 ..., n;
The pairs of correlation that the mark function is calculated according to the initial labels constructs institute according to the pairs of correlation State the generation model of labelling function;
According to the probability tag of data set described in the generation model pre-estimating;
It is trained according to discrimination model of the probability tag to the machine learning system, the differentiation after being trained Model;
Data to be sorted are inputted into the discrimination model after the training, obtain the classification of the data to be sorted.
In alternatively possible implementation, the generation model are as follows:
Wherein Λ indicates that the initial labels matrix that the initial labels are constituted, Y indicate true tag matrix, ZwFor normalization Constant, φi(Λ, yi), i=1,2 ..., m are the pairs of phase for the labelling function of each data in the data set Guan Xing, w are the undetermined parameter for generating model, w ∈ R2n+|C|
In alternatively possible implementation, the pairs of correlation are as follows:
Wherein II { ΛI, jI, kValue of the expression when the condition in bracket { } is set up and is invalid.
In alternatively possible implementation, the data set to be marked is image set, the data to be sorted be to Classification image;Or
The data set to be marked is text set, and the data to be sorted are texts to be sorted;Or
The data set to be marked is voice collection, and the data to be sorted are voices to be sorted.
It is described that data to be sorted are inputted to the discrimination model after the training in alternatively possible implementation, it obtains The classification of the data to be sorted includes:
The image to be classified is inputted into the discrimination model after the training, obtains the corresponding use of the image to be classified Family, object or person face attribute;
By the discrimination model after training described in the text input to be sorted, the corresponding emotion of the text to be sorted is obtained Tendency, subject matter or technical field;
The voice to be sorted is inputted into the discrimination model after the training, obtains the corresponding use of the voice to be sorted Family, age bracket or emotion.
In alternatively possible implementation, it is described according to the probability tag to the differentiation mould of the machine learning system Type, which is trained, includes:
Noise perception variable by minimizing the loss function of the discrimination model trains institute on the probability tag State discrimination model.
It is described to pass through labelling function λ in alternatively possible implementationj, j=1,2 ..., n to the data set into Before rower note, the method also includes:
Fill the missing values in the data set;And/or
Correct the exceptional value in the data set.
The second aspect of the application provides a kind of device for classifying data, is applied to machine learning system, and described device includes:
Module is obtained, for obtaining data set { x to be markedi| i=1,2 ..., m };
Labeling module, for passing through labelling function λj, j=1,2 ..., n are labeled the data set, obtain described The initial labels Λ of data setI, jj(xi), i=1,2 ..., m, j=1,2 ..., n;
Construct module, for according to the initial labels calculate it is described mark function pairs of correlation, according to it is described at The generation model of the labelling function is constructed to correlation;
Module is estimated, the probability tag for the data set according to the generation model pre-estimating;
Training module is obtained for being trained according to discrimination model of the probability tag to the machine learning system Discrimination model after to training;
Categorization module obtains the data to be sorted for data to be sorted to be inputted to the discrimination model after the training Classification.
In alternatively possible implementation, the generation model are as follows:
Wherein Λ indicates that the initial labels matrix that the initial labels are constituted, Y indicate true tag matrix, ZwFor normalization Constant, φi(Λ, yi), i=1,2 ..., m are the pairs of phase for the labelling function of each data in the data set Guan Xing, w are the undetermined parameter for generating model, w ∈ R2n+|C|
In alternatively possible implementation, the pairs of correlation are as follows:
Wherein II { ΛI, jI, kValue of the expression when the condition in bracket { } is set up and is invalid.
In alternatively possible implementation, the data set to be marked is image set, the data to be sorted be to Classification image;Or
The data set to be marked is text set, and the data to be sorted are texts to be sorted;Or
The data set to be marked is voice collection, and the data to be sorted are voices to be sorted.
It is described that data to be sorted are inputted to the discrimination model after the training in alternatively possible implementation, it obtains The classification of the data to be sorted includes:
The image to be classified is inputted into the discrimination model after the training, obtains the corresponding use of the image to be classified Family, object or person face attribute;
By the discrimination model after training described in the text input to be sorted, the corresponding emotion of the text to be sorted is obtained Tendency, subject matter or technical field;
The voice to be sorted is inputted into the discrimination model after the training, obtains the corresponding use of the voice to be sorted Family, age bracket or emotion.
In alternatively possible implementation, it is described according to the probability tag to the differentiation mould of the machine learning system Type, which is trained, includes:
Noise perception variable by minimizing the loss function of the discrimination model trains institute on the probability tag State discrimination model.
In alternatively possible implementation, described device further include:
Preprocessing module, for filling the exception in the missing values in the data set, and/or the amendment data set Value.
The third aspect of the application provides a kind of computer installation, and the computer installation includes processor, the processing Device is for realizing the data classification method when executing the computer program stored in memory.
The fourth aspect of the application provides a kind of computer storage medium, is stored thereon with computer program, the calculating Machine program realizes the data classification method when being executed by processor.
The present invention obtains data set { x to be markedi| i=1,2 ..., m };Pass through labelling function λj, j=1,2 ..., n The data set is labeled, the initial labels Λ of the data set is obtainedI, jj(xi), i=1,2 ..., m, j=1, 2 ..., n;The pairs of correlation that the mark function is calculated according to the initial labels constructs institute according to the pairs of correlation State the generation model of labelling function;According to the probability tag of data set described in the generation model pre-estimating;According to the probability mark Label are trained the discrimination model of the machine learning system, the discrimination model after being trained;Data to be sorted are inputted Discrimination model after the training obtains the classification of the data to be sorted.The present invention can quickly generate machine learning system Discrimination model needed for training data, solve artificial mark training data and obtain that difficulty is big, and label time is long, and accuracy rate obtains not To the workload for the technical issues of guarantee, reducing artificial mark training data, the annotating efficiency and accuracy rate of training data are improved, Discrimination model can be quickly trained using the training data, realizes fast and accurately data classification using the discrimination model.
Detailed description of the invention
Fig. 1 is the flow chart of data classification method provided in an embodiment of the present invention.
Fig. 2 is the structure chart of device for classifying data provided in an embodiment of the present invention.
Fig. 3 is the schematic diagram of computer installation provided in an embodiment of the present invention.
Specific embodiment
To better understand the objects, features and advantages of the present invention, with reference to the accompanying drawing and specific real Applying example, the present invention will be described in detail.It should be noted that in the absence of conflict, embodiments herein and embodiment In feature can be combined with each other.
In the following description, numerous specific details are set forth in order to facilitate a full understanding of the present invention, described embodiment is only It is only a part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, ordinary skill Personnel's every other embodiment obtained without making creative work, shall fall within the protection scope of the present invention.
Unless otherwise defined, all technical and scientific terms used herein and belong to technical field of the invention The normally understood meaning of technical staff is identical.Term as used herein in the specification of the present invention is intended merely to description tool The purpose of the embodiment of body, it is not intended that in the limitation present invention.
Preferably, data classification method of the invention is applied in one or more computer installation.The computer Device is that one kind can be according to the instruction for being previously set or storing, the automatic equipment for carrying out numerical value calculating and/or information processing, Hardware includes but is not limited to microprocessor, specific integrated circuit (Application Specific Integrated Circuit, ASIC), programmable gate array (Field-Programmable Gate Array, FPGA), digital processing unit (Digital Signal Processor, DSP), embedded device etc..
The computer installation can be the calculating such as desktop PC, notebook, palm PC and cloud server and set It is standby.The computer installation can carry out people by modes such as keyboard, mouse, remote controler, touch tablet or voice-operated devices with user Machine interaction.
Embodiment one
Fig. 1 is the flow chart for the data classification method that the embodiment of the present invention one provides.The data classification method is applied to Computer installation.
Data classification method of the invention is applied to machine learning system and uses the training for generating training data Data are trained the discrimination model of machine learning system, treat classification data using the discrimination model after training and are divided Class.Training data needed for the data classification method can quickly generate the discrimination model of machine learning system solves artificial It marks training data and obtains the technical issues of difficulty is big, and label time is long, and accuracy rate cannot be guaranteed, reduce artificial mark training The workload of data improves the annotating efficiency and accuracy rate of training data, can quickly train differentiation using the training data Model realizes fast and accurately data classification using the discrimination model.
As shown in Figure 1, the data classification method includes:
Step 101, data set { x to be marked is obtainedi| i=1,2 ..., m }.
Data set to be marked includes multiple data for needing to be labeled.It is obtained after each data mark in data set The label of the data, for being trained to discrimination model.
In the application scenarios that the data classification method is applied to image classification, the data set to be marked be can be Image set.For example, the data set to be marked includes the image of different user, the corresponding user of each image is marked.Again Such as, the data set to be marked includes the image of different objects (such as pen, ball, book), to mark the corresponding object of each image Body.For another example, the data set to be marked includes the image of different faces attribute (such as race, sex, age, expression), Mark the corresponding face character of each image.
In the application scenarios that the data classification method is applied to text classification, the data set to be marked can also be with It is text set.For example, the data set to be marked includes the text of different emotions tendency, the corresponding feelings of each text are marked Sense tendency.For another example, the data set to be marked includes the text of different subject matters, to mark the corresponding subject matter of each text.Again Such as, the data set to be marked includes the text of different technologies field (such as physics, chemistry, machinery), to mark each text This corresponding technical field.
In the application scenarios that the data classification method is applied to Classification of Speech, the data set to be marked can be with It is voice collection.For example, the data set to be marked includes the voice of multiple and different users, the corresponding use of each voice is marked Family.For another example, the data set to be marked includes the voice of multiple and different age bracket users, to mark each voice corresponding year Age section.For another example, the data set to be marked includes the voice of multiple and different emotions, to mark the corresponding emotion of each voice.
Data set to be marked can be made of the data acquired in real time with real-time data collection.For example, can be with captured in real-time Character image, using the character image of each captured in real-time as the data set to be marked.
Alternatively, data set to be marked can be obtained from preset data source.For example, can be obtained from preset database Data constitute data set to be marked, and a large amount of data, such as image can be stored in advance in the presetting database.
Alternatively, can receive the data set to be marked of user's input.For example, can receive multiple figures of user's input Picture, using the multiple images of user's input as the data set to be marked.
Step 102, pass through labelling function λj, j=1,2 ..., n are to the data set { xi| i=1,2 ..., m } it is marked Note, obtains the data set { xi| i=1,2 ..., m } initial labels ΛI, jj(xi), wherein i=1,2 ..., m, j= 1,2 ..., n.
Labelling function is the function for indicating data Yu label mapping relationship, and labelling function receives data and exports the data Label.Labelling function is black box function, can be expressed as λ:Wherein λ indicates that labelling function, X indicate number According to, Y indicates the corresponding initial labels of X,Indicate labelling function abstention.
Compared with manually mark training data, labelling function allows to utilize various Weakly supervised source-information (such as heuristic letters Breath, external knowledge library etc.) generate the initial labels.For example, including two people of first and second in image, to mark two people's Relationship, it is known that first is the father of fourth, and second is mother of fourth, then " A is the father of C, and B is mother-> A of C according to heuristic information It is man and wife with B ", then it obtains first and second is the annotation results (i.e. initial labels) of man and wife.
Labelling function does not require accuracy rate.That is, being insecure according to the initial labels that labelling function obtains.No It reliably may include marking incorrect, a variety of marks, marking insufficient, part mark etc..
It can be previously according to needing to define multiple labelling functions, such as define 6 labelling functions.
Different labelling functions allows to conflict for the annotation results of the same data.For example, 1 labeled data of labelling function For man and wife, 2 labeled data of labelling function is brother and sister.
Step 103, the mark function lambda is calculated according to the initial labelsj, the pairs of correlation of j=1,2 ..., n, The labelling function λ is constructed according to the pairs of correlationj, the generation model of j=1,2 ..., n.
Labelling function λj, j=1, the pairs of correlation of 2 ..., n refers to the dependence between two labelling functions.This Statistics dependence between Method Modeling labelling function, to improve estimated performance.For example, if two labelling functions express class As heuristic information, can be in generating model comprising this dependence and avoid " computing repeatedly " problem.This pairs of correlation Property be most common, thus select labelling function the set C of (j, k) is modeled as it is relevant.
In the present embodiment, labelling function λj, the pairs of correlation of j=1,2 ..., n can indicate are as follows:
Wherein, Λ is the initial labels matrix that the initial labels are constituted, ΛI, jj(xi);Y is true tag matrix, Y =(y1, y2..., ym);C is set of the labelling function to (i, j).II{ΛI, jI, kIndicate condition within bracket { } at Value when vertical and invalid.In the present embodiment, when the condition in bracket { } is set up, value is 1, the item within bracket { } Value is 0 when part is invalid.
The labelling function λ can be constructed according to the pairs of correlationj, the generation model of j=1,2 ..., n.
The core operation of this method is that the noise signal provided one group of labelling function is modeled and integrated, by each mark Note function modelling is a noise " voter ", generates mistake relevant to other labelling functions.
For data set { xi| i=1,2 ..., m } in each data xi, initial labels vector is Λi=(ΛI, 1, ΛI, 2,..., ΛI, n).In the present embodiment, according to the generation model of the pairs of correlation building are as follows:
Wherein ZwIt is normaliztion constant, φi(Λ, yi) be for each data in the data set the mark letter Number λj, the pairs of correlation of j=1,2 ..., n, w is the undetermined parameter for generating model, w ∈ R2n+|C|
Step 104, the data set { x according to the generation model pre-estimatingi| i=1,2 ..., m } probability tag
In the present embodiment, model is generated are as follows:
To learn the model in the case where not accessing true tag, it can be in the given initial labels matrix Λ observed In the case where minimize negative logarithm marginal probability:
It can be generated by staggeredly executing stochastic gradient descent step and this target of gibbs sampler optimization order The estimated value of the parameter w of model
The estimated valueThe generation model determines that after determination, and initial labels matrix Λ is inputted the generation mould Data set { the x can be obtained in typei| i=1,2 ..., m } probability tag
Step 103-104 is exactly using generation model to the initial labels ΛI, jj(xi) denoised, it obtains described Data set { xi| i=1,2 ..., m } probability tag.The data set with the probability tag, that is, machine learning The training data of system.
Step 105, it is trained, is trained according to discrimination model of the probability tag to the machine learning system Discrimination model afterwards.
It is trained according to discrimination model of the probability tag to the machine learning system, being exactly will be with described general The data set of rate label is trained the discrimination model as training sample.
According to the probability tagDiscrimination model is trained, final goal is training one beyond labelling function institute The discrimination model of expressing information.Loss function l (the h of the minimum discrimination model can be passed throughθ(xi), y) noise perception become Amount(i.e. relative toExpected loss) in probability tagUpper trained discrimination model hθ:
When being trained to the discrimination model, the parameter of discrimination model is adjusted, obtains the noise perception variable Minimum value.Trained process can use RMSprop algorithm.RMSprop is a kind of improved stochastic gradient descent algorithm. RMSprop algorithm is techniques well known, and details are not described herein again.
Step 106, data to be sorted are inputted into the discrimination model after the training, obtains the class of the data to be sorted Not.
In the application scenarios that the data classification method is applied to image classification, image to be classified is inputted into the training Discrimination model afterwards obtains the classification of the image to be classified.For example, obtaining the corresponding user of the image to be classified.For another example, Obtain the corresponding object of the image to be classified.For another example, the corresponding face character of the image to be classified is obtained.
In the application scenarios that the data classification method is applied to text classification, by training described in text input to be sorted Discrimination model afterwards obtains the classification of the text to be sorted.For example, obtain the text to be sorted Sentiment orientation (such as Positive emotion tendency or negative emotion tendency).For another example, the corresponding subject matter of the text to be sorted is obtained.For another example, obtain it is described to The technical field of classifying text.
It, will be described in voice to be sorted input in the application scenarios that the training data generation method is applied to Classification of Speech Discrimination model after training obtains the classification of the voice to be sorted.For example, obtaining the corresponding user of the voice to be sorted. For another example, the corresponding age bracket of the voice to be sorted is obtained.For another example, the emotion of the voice to be sorted is obtained.
The data classification method of embodiment one obtains data set { x to be markedi| i=1,2 ..., m };By marking letter Number λj, j=1,2 ..., n be labeled the data set, obtains the initial labels Λ of the data setI, jj(xi), i= 1,2 ..., m, j=1,2 ..., n;The pairs of correlation that the mark function is calculated according to the initial labels, according to described Pairs of correlation constructs the generation model of the labelling function;According to the probability mark of data set described in the generation model pre-estimating Label;It is trained according to discrimination model of the probability tag to the machine learning system, the discrimination model after being trained; Data to be sorted are inputted into the discrimination model after the training, obtain the classification of the data to be sorted.The present embodiment can be fast Training data needed for discrimination model of the fast-growing at machine learning system, the artificial mark training data acquisition difficulty of solution is big, mark The technical issues of time is long, and accuracy rate cannot be guaranteed is infused, the workload of artificial mark training data is reduced, improves training data Annotating efficiency and accuracy rate, can quickly train discrimination model using the training data, utilize the discrimination model realize Fast and accurately data classification.
In another embodiment, passing through labelling function λjTo the data set { xi| i=1,2 ..., m } it is labeled it Before, the method can also include: to the data set { xi| i=1,2 ..., m } it is pre-processed.
To the data set { xi| i=1,2 ..., m } carry out pretreatment may include filling the data set { xi| i=1, 2 ..., m } in missing values.
K- nearest neighbor algorithm can be used, determines that distance has the nearest K data of data of missing values (such as according to Europe Formula distance determines that distance has the nearest K data of data of missing values), the numeric weights of K data are averagely estimated this The missing values of data.
Alternatively, can predict missing values using prediction model, if missing values are numeric types, average value can be used The missing values are filled, if missing values are non-numeric types, the missing values can be filled using mode.
Alternatively, missing values can be substituted using averaging method.Preferably due to be using the method for averaging method substitution missing values It establishes on the hypothesis that completely random lacks, the variance and standard deviation that will cause data become smaller, thus, the method can be with Include: that the Filling power obtained after being substituted by mean value and default sampling factor carry out quadrature, obtains new data as final Filling power.The default sampling factor is pre-set sampling factor, and the sampling factor is greater than 1.
The missing values can also be filled using other methods.For example, the method or interpolation of regression fit can be passed through Method is filled missing values.
To the data set { xi| i=1,2 ..., m } to carry out pretreatment can also include correcting the data set { xi| i= 1,2 ..., m } in exceptional value.Exceptional value is the numerical value for deviating considerably from other data.
The method for correcting exceptional value can be the same as the method for filling missing values.For example, K- nearest neighbor algorithm can be used, determine K nearest data of data of the distance with exceptional value (such as determine that distance has the data of exceptional value most according to Euclidean distance K close data), by the numeric weights of K data averagely come estimate the data exceptional value correction value.Alternatively, can adopt With prediction model come forecast value revision value, if exceptional value is numeric type, the exceptional value can be corrected using average value, such as Fruit exceptional value is non-numeric type, and the exceptional value can be corrected using mode.
Alternatively, exceptional value can be replaced using averaging method.Preferably due to be using the method for averaging method replacement exceptional value It establishes on the hypothesis that completely random lacks, the variance and standard deviation that will cause data become smaller, thus, the method can be with Include: that the correction value obtained after being replaced by mean value and default sampling factor carry out quadrature, obtains new numerical value as final Correction value.The default sampling factor is pre-set sampling factor, and the sampling factor is greater than 1.
The exceptional value can also be corrected using other methods.For example, the method or interpolation of regression fit can be passed through Method is modified exceptional value.
The method of amendment exceptional value can also be different the method with filling missing values.
Pre-processing to the data set can also include that directly discarding has the data of missing values and/or has exceptional value Data.Directly the data for having missing values and/or the data for having exceptional value are abandoned, it is ensured that data set it is clean.
Embodiment two
Fig. 2 is the structure chart of device for classifying data provided by Embodiment 2 of the present invention.The device for classifying data 20 is applied In machine learning system, for generating training data, carried out using discrimination model of the training data to machine learning system Training, treats classification data using the discrimination model after training and classifies.The device for classifying data 20 can quickly generate Training data needed for the discrimination model of machine learning system, the artificial mark training data acquisition difficulty of solution is big, label time The technical issues of length, accuracy rate cannot be guaranteed, reduces the workload of artificial mark training data, improves the mark of training data Efficiency and accuracy rate can quickly train discrimination model using the training data, be realized using the discrimination model quickly quasi- True data classification.
As shown in Fig. 2, the device for classifying data 20 may include obtaining module 201, labeling module 202, building module 203, module 204, training module 205, categorization module 206 are estimated.
Module 201 is obtained, for obtaining data set { x to be markedi| i=1,2 ..., m }.
Data set to be marked includes multiple data for needing to be labeled.It is obtained after each data mark in data set The label of the data, for being trained to discrimination model.
In the application scenarios that the data classification method is applied to image classification, the data set to be marked be can be Image set.For example, the data set to be marked includes the image of different user, the corresponding user of each image is marked.Again Such as, the data set to be marked includes the image of different objects (such as pen, ball, book), to mark the corresponding object of each image Body.For another example, the data set to be marked includes the image of different faces attribute (such as race, sex, age, expression), Mark the corresponding face character of each image.
In the application scenarios that the data classification method is applied to text classification, the data set to be marked can also be with It is text set.For example, the data set to be marked includes the text of different emotions tendency, the corresponding feelings of each text are marked Sense tendency.For another example, the data set to be marked includes the text of different subject matters, to mark the corresponding subject matter of each text.Again Such as, the data set to be marked includes the text of different technologies field (such as physics, chemistry, machinery), to mark each text This corresponding technical field.
In the application scenarios that the data classification method is applied to Classification of Speech, the data set to be marked can be with It is voice collection.For example, the data set to be marked includes the voice of multiple and different users, the corresponding use of each voice is marked Family.For another example, the data set to be marked includes the voice of multiple and different age bracket users, to mark each voice corresponding year Age section.For another example, the data set to be marked includes the voice of multiple and different emotions, to mark the corresponding emotion of each voice.
Data set to be marked can be made of the data acquired in real time with real-time data collection.For example, can be with captured in real-time Character image, using the character image of each captured in real-time as the data set to be marked.
Alternatively, data set to be marked can be obtained from preset data source.For example, can be obtained from preset database Data constitute data set to be marked, and a large amount of data, such as image can be stored in advance in the presetting database.
Alternatively, can receive the data set to be marked of user's input.For example, can receive multiple figures of user's input Picture, using the multiple images of user's input as the data set to be marked.
Labeling module 202, for passing through labelling function λjTo the data set { xi| i=1,2 ..., m } it is labeled, it obtains To the data set { xi| i=1,2 ..., m } initial labels ΛI, jj(xi), wherein i=1,2 ..., m, j=1, 2 ..., n.
Labelling function is the function for indicating data Yu label mapping relationship, and labelling function receives data and exports the data Label.Labelling function is black box function, can be expressed as λ:Wherein λ indicates that labelling function, X indicate number According to, Y indicates the corresponding initial labels of X,Indicate labelling function abstention.
Compared with manually mark training data, labelling function allows to utilize various Weakly supervised source-information (such as heuristic letters Breath, external knowledge library etc.) generate the initial labels.For example, including two people of first and second in image, to mark two people's Relationship, it is known that first is the father of fourth, and second is mother of fourth, then " A is the father of C, and B is mother-> A of C according to heuristic information It is man and wife with B ", then it obtains first and second is the annotation results (i.e. initial labels) of man and wife.
Labelling function does not require accuracy rate.That is, being insecure according to the initial labels that labelling function obtains.No It reliably may include marking incorrect, a variety of marks, marking insufficient, part mark etc..
It can be previously according to needing to define multiple labelling functions, such as define 6 labelling functions.
Different labelling functions allows to conflict for the annotation results of the same data.For example, 1 labeled data of labelling function For man and wife, 2 labeled data of labelling function is brother and sister.
Module 203 is constructed, for calculating the mark function lambda according to the initial labelsj, j=1,2 ..., n's is pairs of Correlation constructs the labelling function λ according to the pairs of correlationj, the generation model of j=1,2 ..., n.
Labelling function λj, j=1, the pairs of correlation of 2 ..., n refers to the dependence between two labelling functions.This Statistics dependence between Method Modeling labelling function, to improve estimated performance.For example, if two labelling functions express class As heuristic information, can be in generating model comprising this dependence and avoid " computing repeatedly " problem.This pairs of correlation Property be most common, thus select labelling function the set C of (j, k) is modeled as it is relevant.
In the present embodiment, labelling function λj, the pairs of correlation of j=1,2 ..., n can indicate are as follows:
Wherein, Λ is the initial labels matrix that the initial labels are constituted, ΛI, jj(xi);Y is true tag matrix, Y =(y1, y2..., ym);C is set of the labelling function to (i, j).II{ΛI, jI, kIndicate condition within bracket { } at Value when vertical and invalid.In the present embodiment, when the condition in bracket { } is set up, value is 1, the item within bracket { } Value is O when part is invalid.
The labelling function λ can be constructed according to the pairs of correlationj, the generation model of j=1,2 ..., n.
The core operation of this method is that the noise signal provided one group of labelling function is modeled and integrated, by each mark Note function modelling is a noise " voter ", generates mistake relevant to other labelling functions.
For data set { xi| i=1,2 ..., m } in each data xi, initial labels vector is Λi=(ΛI, 1, ΛI, 2,..., ΛI, n).In the present embodiment, according to the generation model of the pairs of correlation building are as follows:
Wherein ZwIt is normaliztion constant, φi(Λ, yi) be for each data in the data set the mark letter Number λj, the pairs of correlation of j=1,2 ..., n, w is the undetermined parameter for generating model, w ∈ R2n+|C|
Module 204 is estimated, the data set { x according to the generation model pre-estimating is used fori| i=1,2 ..., m } probability Label
In the present embodiment, model is generated are as follows:
To learn the model in the case where not accessing true tag, it can be in the given initial labels matrix Λ observed In the case where minimize negative logarithm marginal probability:
It can be generated by staggeredly executing stochastic gradient descent step and this target of gibbs sampler optimization order The estimated value of the parameter w of model
The estimated valueThe generation model determines that after determination, and initial labels matrix Λ is inputted the generation mould Data set { the x can be obtained in typei| i=1,2 ..., m } probability tag
Constructing module 203, estimating module 204 is exactly using generation model to the initial labels ΛI, jj(xi) carry out Denoising, obtains the data set { xi| i=1,2 ..., m } probability tag.The data set with the probability tag is The training data of the obtained machine learning system.
Training module 205, for being trained according to discrimination model of the probability tag to the machine learning system, Discrimination model after being trained.
It is trained according to discrimination model of the probability tag to the machine learning system, being exactly will be with described general The data set of rate label is trained the discrimination model as training sample.
According to the probability tagDiscrimination model is trained, final goal is training one beyond labelling function institute The discrimination model of expressing information.Loss function l (the h of the minimum discrimination model can be passed throughθ(xi), y) noise perception become Amount(i.e. relative toExpected loss) in probability tagUpper trained discrimination model hθ:
When being trained to the discrimination model, the parameter of discrimination model is adjusted, obtains the noise perception variable Minimum value.Trained process can use RMSprop algorithm.RMSprop is a kind of improved stochastic gradient descent algorithm. RMSprop algorithm is techniques well known, and details are not described herein again.
Categorization module 206 obtains the number to be sorted for data to be sorted to be inputted to the discrimination model after the training According to classification.
In the application scenarios that the data classification method is applied to image classification, image to be classified is inputted into the training Discrimination model afterwards obtains the classification of the image to be classified.For example, obtaining the corresponding user of the image to be classified.For another example, Obtain the corresponding object of the image to be classified.For another example, the corresponding face character of the image to be classified is obtained.
In the application scenarios that the data classification method is applied to text classification, by training described in text input to be sorted Discrimination model afterwards obtains the classification of the text to be sorted.For example, obtain the text to be sorted Sentiment orientation (such as Positive emotion tendency or negative emotion tendency).For another example, the corresponding subject matter of the text to be sorted is obtained.For another example, obtain it is described to The technical field of classifying text.
It, will be described in voice to be sorted input in the application scenarios that the training data generation method is applied to Classification of Speech Discrimination model after training obtains the classification of the voice to be sorted.For example, obtaining the corresponding user of the voice to be sorted. For another example, the corresponding age bracket of the voice to be sorted is obtained.For another example, the emotion of the voice to be sorted is obtained.
The device for classifying data 20 of embodiment two obtains data set { x to be markedi| i=1,2 ..., m };Pass through mark Function lambdaj, j=1,2 ..., n be labeled the data set, obtains the initial labels Λ of the data setI, jj(xi), i =1,2 ..., m, j=1,2 ..., n;The pairs of correlation that the mark function is calculated according to the initial labels, according to institute State the generation model that pairs of correlation constructs the labelling function;According to the probability mark of data set described in the generation model pre-estimating Label;It is trained according to discrimination model of the probability tag to the machine learning system, the discrimination model after being trained; Data to be sorted are inputted into the discrimination model after the training, obtain the classification of the data to be sorted.The present embodiment can be fast Training data needed for discrimination model of the fast-growing at machine learning system, the artificial mark training data acquisition difficulty of solution is big, mark The technical issues of time is long, and accuracy rate cannot be guaranteed is infused, the workload of artificial mark training data is reduced, improves training data Annotating efficiency and accuracy rate, can quickly train discrimination model using the training data, utilize the discrimination model realize Fast and accurately data classification.
In another embodiment, the device for classifying data 20 can also include: preprocessing module, for the data Collect { xi| i=1,2 ..., m } it is pre-processed.
To the data set { xi| i=1,2 ..., m } carry out pretreatment may include filling the data set { xi| i=1, 2 ..., m } in missing values.
K- nearest neighbor algorithm can be used, determines that distance has the nearest K data of data of missing values (such as according to Europe Formula distance determines that distance has the nearest K data of data of missing values), the numeric weights of K data are averagely estimated this The missing values of data.
Alternatively, can predict missing values using prediction model, if missing values are numeric types, average value can be used The missing values are filled, if missing values are non-numeric types, the missing values can be filled using mode.
Alternatively, missing values can be substituted using averaging method.Preferably due to be using the method for averaging method substitution missing values It establishes on the hypothesis that completely random lacks, the variance and standard deviation that will cause data become smaller, thus, the method can be with Include: that the Filling power obtained after being substituted by mean value and default sampling factor carry out quadrature, obtains new data as final Filling power.The default sampling factor is pre-set sampling factor, and the sampling factor is greater than 1.
The missing values can also be filled using other methods.For example, the method or interpolation of regression fit can be passed through Method is filled missing values.
To the data set { xi| i=1,2 ..., m } to carry out pretreatment can also include correcting the data set { xi| i= 1,2 ..., m } in exceptional value.Exceptional value is the numerical value for deviating considerably from other data.
The method for correcting exceptional value can be the same as the method for filling missing values.For example, K- nearest neighbor algorithm can be used, determine K nearest data of data of the distance with exceptional value (such as determine that distance has the data of exceptional value most according to Euclidean distance K close data), by the numeric weights of K data averagely come estimate the data exceptional value correction value.Alternatively, can adopt With prediction model come forecast value revision value, if exceptional value is numeric type, the exceptional value can be corrected using average value, such as Fruit exceptional value is non-numeric type, and the exceptional value can be corrected using mode.
Alternatively, exceptional value can be replaced using averaging method.Preferably due to be using the method for averaging method replacement exceptional value It establishes on the hypothesis that completely random lacks, the variance and standard deviation that will cause data become smaller, thus, the method can be with Include: that the correction value obtained after being replaced by mean value and default sampling factor carry out quadrature, obtains new numerical value as final Correction value.The default sampling factor is pre-set sampling factor, and the sampling factor is greater than 1.
The exceptional value can also be corrected using other methods.For example, the method or interpolation of regression fit can be passed through Method is modified exceptional value.
The method of amendment exceptional value can also be different the method with filling missing values.
Pre-processing to the data set can also include that directly discarding has the data of missing values and/or has exceptional value Data.Directly the data for having missing values and/or the data for having exceptional value are abandoned, it is ensured that data set it is clean.
Embodiment three
The present embodiment provides a kind of computer storage medium, it is stored with computer program in the computer storage medium, it should The step in above-mentioned data classification method embodiment, such as step shown in FIG. 1 are realized when computer program is executed by processor 101-106:
Step 101, data set { x to be marked is obtainedi| i=1,2 ..., m };
Step 102, pass through labelling function λj, j=1,2 ..., n be labeled the data set, obtains the data The initial labels Λ of collectionI, jj(xi), i=1,2 ..., m, j=1,2 ..., n;
Step 103, the pairs of correlation that the mark function is calculated according to the initial labels, according to the pairs of correlation Property constructs the generation model of the labelling function;
Step 104, the probability tag of the data set according to the generation model pre-estimating;
Step 105, it is trained, is trained according to discrimination model of the probability tag to the machine learning system Discrimination model afterwards;
Step 106, data to be sorted are inputted into the discrimination model after the training, obtains the class of the data to be sorted Not.
Alternatively, the function of each module in above-mentioned apparatus embodiment is realized when the computer program is executed by processor, such as Module 201-206 in Fig. 2:
Module 201 is obtained, for obtaining data set { x to be markedi| i=1,2 ..., m };
Labeling module 202, for passing through labelling function λj, j=1,2 ..., n are labeled the data set, obtain The initial labels Λ of the data setI, jj(xi), i=1,2 ..., m, j=1,2 ..., n;
Module 203 is constructed, for calculating the pairs of correlation of the mark function according to the initial labels, according to described Pairs of correlation constructs the generation model of the labelling function;
Module 204 is estimated, the probability tag for the data set according to the generation model pre-estimating;
Training module 205, for being trained according to discrimination model of the probability tag to the machine learning system, Discrimination model after being trained;
Categorization module 206 obtains the number to be sorted for data to be sorted to be inputted to the discrimination model after the training According to classification.
Example IV
Fig. 3 is the schematic diagram for the computer installation that the embodiment of the present invention four provides.The computer installation 30 includes storage Device 301, processor 302 and it is stored in the computer program that can be run in the memory 301 and on the processor 302 303, such as data classification program.The processor 302 realizes above-mentioned data classification method when executing the computer program 303 Step in embodiment, such as step 101-106 shown in FIG. 1:
Step 101, data set { x to be marked is obtainedi| i=1,2 ..., m };
Step 102, pass through labelling function λj, j=1,2 ..., n be labeled the data set, obtains the data The initial labels Λ of collectionI, jj(xi), i=1,2 ..., m, j=1,2 ..., n;
Step 103, the pairs of correlation that the mark function is calculated according to the initial labels, according to the pairs of correlation Property constructs the generation model of the labelling function;
Step 104, the probability tag of the data set according to the generation model pre-estimating;
Step 105, it is trained, is trained according to discrimination model of the probability tag to the machine learning system Discrimination model afterwards;
Step 106, data to be sorted are inputted into the discrimination model after the training, obtains the class of the data to be sorted Not.
Alternatively, the function of each module in above-mentioned apparatus embodiment is realized when the computer program is executed by processor, such as Module 201-206 in Fig. 2:
Module 201 is obtained, for obtaining data set { x to be markedi| i=1,2 ..., m };
Labeling module 202, for passing through labelling function λj, j=1,2 ..., n are labeled the data set, obtain The initial labels Λ of the data setI, jj(xi), i=1,2 ..., m, j=1,2 ..., n;
Module 203 is constructed, for calculating the pairs of correlation of the mark function according to the initial labels, according to described Pairs of correlation constructs the generation model of the labelling function;
Module 204 is estimated, the probability tag for the data set according to the generation model pre-estimating;
Training module 205, for being trained according to discrimination model of the probability tag to the machine learning system, Discrimination model after being trained;
Categorization module 206 obtains the number to be sorted for data to be sorted to be inputted to the discrimination model after the training According to classification.
Illustratively, the computer program 303 can be divided into one or more modules, one or more of Module is stored in the memory 301, and is executed by the processor 302, to complete this method.It is one or more of Module can be the series of computation machine program instruction section that can complete specific function, and the instruction segment is for describing the computer Implementation procedure of the program 303 in the computer installation 30.For example, the computer program 303 can be divided into Fig. 2 Acquisition module 201, labeling module 202, building module 203, estimate module 204, training module 205, categorization module 206, respectively Module concrete function is referring to embodiment two.
The computer installation 30 can be the calculating such as desktop PC, notebook, palm PC and cloud server Equipment.It will be understood by those skilled in the art that the schematic diagram 3 is only the example of computer installation 30, do not constitute to meter The restriction of calculation machine device 30 may include perhaps combining certain components or different portions than illustrating more or fewer components Part, such as the computer installation 30 can also include input-output equipment, network access equipment, bus etc..
Alleged processor 302 can be central processing unit (Central Processing Unit, CPU), can also be Other general processors, digital signal processor (Digital Signal Processor, DSP), specific integrated circuit (Application Specific Integrated Circuit, ASIC), ready-made programmable gate array (Field- Programmable Gate Array, FPGA) either other programmable logic device, discrete gate or transistor logic, Discrete hardware components etc..General processor can be microprocessor or the processor 302 is also possible to any conventional processing Device etc., the processor 302 are the control centres of the computer installation 30, are entirely calculated using various interfaces and connection The various pieces of machine device 30.
The memory 301 can be used for storing the computer program 303, and the processor 302 is by operation or executes The computer program or module being stored in the memory 301, and the data being stored in memory 301 are called, it realizes The various functions of the computer installation 30.The memory 301 can mainly include storing program area and storage data area, In, storing program area can application program needed for storage program area, at least one function (such as sound-playing function, image Playing function etc.) etc.;Storage data area, which can be stored, uses created data (such as audio number according to computer installation 30 According to, phone directory etc.) etc..In addition, memory 301 may include high-speed random access memory, it can also include non-volatile deposit Reservoir, such as hard disk, memory, plug-in type hard disk, intelligent memory card (Smart Media Card, SMC), secure digital (Secure Digital, SD) card, flash card (Flash Card), at least one disk memory, flush memory device or other Volatile solid-state part.
If the integrated module of the computer installation 30 is realized in the form of software function module and as independent production Product when selling or using, can store in a computer storage medium.Based on this understanding, the present invention realizes above-mentioned reality The all or part of the process in a method is applied, relevant hardware can also be instructed to complete by computer program, it is described Computer program can be stored in a computer storage medium, and the computer program is when being executed by processor, it can be achieved that above-mentioned The step of each embodiment of the method.Wherein, the computer program includes computer program code, the computer program code It can be source code form, object identification code form, executable file or certain intermediate forms etc..The computer-readable medium can With include: can carry the computer program code any entity or device, recording medium, USB flash disk, mobile hard disk, magnetic disk, CD, computer storage, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), electric carrier signal, telecommunication signal and software distribution medium etc..It should be noted that the computer The content that readable medium includes can carry out increase and decrease appropriate according to the requirement made laws in jurisdiction with patent practice, such as It does not include electric carrier signal and telecommunication signal according to legislation and patent practice, computer-readable medium in certain jurisdictions.
In several embodiments provided by the present invention, it should be understood that disclosed system, device and method can be with It realizes by another way.For example, the apparatus embodiments described above are merely exemplary, for example, the module It divides, only a kind of logical function partition, there may be another division manner in actual implementation.
The module as illustrated by the separation member may or may not be physically separated, aobvious as module The component shown may or may not be physical module, it can and it is in one place, or may be distributed over multiple In network unit.Some or all of the modules therein can be selected to realize the mesh of this embodiment scheme according to the actual needs 's.
It, can also be in addition, each functional module in each embodiment of the present invention can integrate in a processing module It is that modules physically exist alone, can also be integrated in two or more modules in a module.Above-mentioned integrated mould Block both can take the form of hardware realization, can also realize in the form of hardware adds software function module.
The above-mentioned integrated module realized in the form of software function module, can store in a computer storage medium In.Above-mentioned software function module is stored in a storage medium, including some instructions are used so that a computer equipment (can To be personal computer, server or the network equipment etc.) or each embodiment of processor (processor) the execution present invention The part steps of the method.
It is obvious to a person skilled in the art that invention is not limited to the details of the above exemplary embodiments, Er Qie In the case where without departing substantially from spirit or essential attributes of the invention, the present invention can be realized in other specific forms.Therefore, no matter From the point of view of which point, the present embodiments are to be considered as illustrative and not restrictive, and the scope of the present invention is by appended power Benefit requires rather than above description limits, it is intended that all by what is fallen within the meaning and scope of the equivalent elements of the claims Variation is included in the present invention.Any attached associated diagram label in claim should not be considered as right involved in limitation to want It asks.Furthermore, it is to be understood that one word of " comprising " is not excluded for other modules or step, odd number is not excluded for plural number.It is stated in system claims Multiple modules or device can also be implemented through software or hardware by a module or device.The first, the second equal words It is used to indicate names, and does not indicate any particular order.
Finally it should be noted that the above examples are only used to illustrate the technical scheme of the present invention and are not limiting, although reference Preferred embodiment describes the invention in detail, those skilled in the art should understand that, it can be to of the invention Technical solution is modified or equivalent replacement, without departing from the spirit and scope of the technical solution of the present invention.

Claims (10)

1. a kind of data classification method is applied to machine learning system, which is characterized in that the described method includes:
Obtain data set { x to be markedi| i=1,2 ..., m };
Pass through labelling function λj, j=1,2 ..., n be labeled the data set, obtains the initial labels of the data set Λi,jj(xi), i=1,2 ..., m, j=1,2 ..., n;
The pairs of correlation that the mark function is calculated according to the initial labels constructs the mark according to the pairs of correlation Infuse the generation model of function;
According to the probability tag of data set described in the generation model pre-estimating;
It is trained according to discrimination model of the probability tag to the machine learning system, the differentiation mould after being trained Type;
Data to be sorted are inputted into the discrimination model after the training, obtain the classification of the data to be sorted.
2. the method as described in claim 1, which is characterized in that the generation model are as follows:
Wherein Λ indicates that the initial labels matrix that the initial labels are constituted, Y indicate true tag matrix, ZwFor normaliztion constant, φi(Λ,yi), i=1,2 ..., m are the pairs of correlation for the labelling function of each data in the data set, w For the undetermined parameter for generating model, w ∈ R2n+|C|
3. method according to claim 2, which is characterized in that the pairs of correlation are as follows:
WhereinIndicate the value when the condition in bracket { } is set up and is invalid.
4. the method as described in claim 1, it is characterised in that:
The data set to be marked is image set, and the data to be sorted are images to be classified;Or
The data set to be marked is text set, and the data to be sorted are texts to be sorted;Or
The data set to be marked is voice collection, and the data to be sorted are voices to be sorted.
5. method as claimed in claim 4, which is characterized in that described that data to be sorted are inputted to the differentiation mould after the training Type, the classification for obtaining the data to be sorted include:
The image to be classified is inputted into the discrimination model after the training, obtains the corresponding user of the image to be classified, object Body or face character;
By the discrimination model after training described in the text input to be sorted, obtains the corresponding emotion of the text to be sorted and incline To, subject matter or technical field;
The voice to be sorted is inputted into the discrimination model after the training, obtains the corresponding user of the voice to be sorted, year Age section or emotion.
6. the method as described in claim 1, which is characterized in that it is described according to the probability tag to the machine learning system Discrimination model be trained and include:
Noise perception variable by minimizing the loss function of the discrimination model is sentenced described in training on the probability tag Other model.
7. such as method of any of claims 1-6, which is characterized in that described to pass through labelling function λj, j=1,2 ..., Before n is labeled the data set, the method also includes:
Fill the missing values in the data set;And/or
Correct the exceptional value in the data set.
8. a kind of device for classifying data, it is applied to machine learning system, which is characterized in that described device includes:
Module is obtained, for obtaining data set { x to be markedi| i=1,2 ..., m };
Labeling module, for passing through labelling function λj, j=1,2 ..., n be labeled the data set, obtains the data set Initial labels Λi,jj(xi), i=1,2 ..., m, j=1,2 ..., n;
Module is constructed, for calculating the pairs of correlation of the mark function according to the initial labels, according to the pairs of phase Closing property constructs the generation model of the labelling function;
Module is estimated, the probability tag for the data set according to the generation model pre-estimating;
Training module is instructed for being trained according to discrimination model of the probability tag to the machine learning system Discrimination model after white silk;
Categorization module obtains the class of the data to be sorted for data to be sorted to be inputted to the discrimination model after the training Not.
9. a kind of computer installation, it is characterised in that: the computer installation includes processor, and the processor is deposited for executing The computer program stored in reservoir is to realize the data classification method as described in any one of claim 1-7.
10. a kind of computer storage medium, computer program is stored in the computer storage medium, it is characterised in that: institute It states and realizes the data classification method as described in any one of claim 1-7 when computer program is executed by processor.
CN201910310574.1A 2019-04-17 2019-04-17 Data classification method, device, computer installation and storage medium Pending CN110196908A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910310574.1A CN110196908A (en) 2019-04-17 2019-04-17 Data classification method, device, computer installation and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910310574.1A CN110196908A (en) 2019-04-17 2019-04-17 Data classification method, device, computer installation and storage medium

Publications (1)

Publication Number Publication Date
CN110196908A true CN110196908A (en) 2019-09-03

Family

ID=67752025

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910310574.1A Pending CN110196908A (en) 2019-04-17 2019-04-17 Data classification method, device, computer installation and storage medium

Country Status (1)

Country Link
CN (1) CN110196908A (en)

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110648325A (en) * 2019-09-29 2020-01-03 北部湾大学 Nixing pottery big data analysis genuine product checking system
CN111291823A (en) * 2020-02-24 2020-06-16 腾讯科技(深圳)有限公司 Fusion method and device of classification models, electronic equipment and storage medium
CN111582360A (en) * 2020-05-06 2020-08-25 北京字节跳动网络技术有限公司 Method, apparatus, device and medium for labeling data
CN111860572A (en) * 2020-06-04 2020-10-30 北京百度网讯科技有限公司 Data set distillation method, device, electronic equipment and storage medium
CN111967541A (en) * 2020-10-21 2020-11-20 上海冰鉴信息科技有限公司 Data classification method and device based on multi-platform samples
CN112102062A (en) * 2020-07-24 2020-12-18 北京淇瑀信息科技有限公司 Risk assessment method and device based on weak supervised learning and electronic equipment
CN112115267A (en) * 2020-09-28 2020-12-22 平安科技(深圳)有限公司 Training method, device and equipment of text classification model and storage medium
CN112199502A (en) * 2020-10-26 2021-01-08 网易(杭州)网络有限公司 Emotion-based poetry sentence generation method and device, electronic equipment and storage medium
CN112488141A (en) * 2019-09-12 2021-03-12 中移(苏州)软件技术有限公司 Method and device for determining application range of Internet of things card and computer readable storage medium
CN112529024A (en) * 2019-09-17 2021-03-19 株式会社理光 Sample data generation method and device and computer readable storage medium
CN112651447A (en) * 2020-12-29 2021-04-13 广东电网有限责任公司电力调度控制中心 Resource classification labeling method and system based on ontology
CN112825144A (en) * 2019-11-20 2021-05-21 深圳云天励飞技术有限公司 Picture labeling method and device, electronic equipment and storage medium
CN112860919A (en) * 2021-02-20 2021-05-28 平安科技(深圳)有限公司 Data labeling method, device and equipment based on generative model and storage medium
CN112925958A (en) * 2021-02-05 2021-06-08 深圳力维智联技术有限公司 Multi-source heterogeneous data adaptation method, device, equipment and readable storage medium
CN113344916A (en) * 2021-07-21 2021-09-03 上海媒智科技有限公司 Method, system, terminal, medium and application for acquiring machine learning model capability
CN113590677A (en) * 2021-07-14 2021-11-02 上海淇玥信息技术有限公司 Data processing method and device and electronic equipment
CN113761925A (en) * 2021-07-23 2021-12-07 中国科学院自动化研究所 Named entity identification method, device and equipment based on noise perception mechanism
CN113836118A (en) * 2021-11-24 2021-12-24 亿海蓝(北京)数据技术股份公司 Ship static data supplementing method and device, electronic equipment and readable storage medium
CN114064973A (en) * 2022-01-11 2022-02-18 人民网科技(北京)有限公司 Video news classification model establishing method, classification method, device and equipment
CN112825144B (en) * 2019-11-20 2024-06-07 深圳云天励飞技术有限公司 Picture marking method and device, electronic equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104463202A (en) * 2014-11-28 2015-03-25 苏州大学 Multi-class image semi-supervised classifying method and system
CN107220600A (en) * 2017-05-17 2017-09-29 清华大学深圳研究生院 A kind of Picture Generation Method and generation confrontation network based on deep learning
CN108229543A (en) * 2017-12-22 2018-06-29 中国科学院深圳先进技术研究院 Image classification design methods and device
US20190080164A1 (en) * 2017-09-14 2019-03-14 Chevron U.S.A. Inc. Classification of character strings using machine-learning

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104463202A (en) * 2014-11-28 2015-03-25 苏州大学 Multi-class image semi-supervised classifying method and system
CN107220600A (en) * 2017-05-17 2017-09-29 清华大学深圳研究生院 A kind of Picture Generation Method and generation confrontation network based on deep learning
US20190080164A1 (en) * 2017-09-14 2019-03-14 Chevron U.S.A. Inc. Classification of character strings using machine-learning
CN108229543A (en) * 2017-12-22 2018-06-29 中国科学院深圳先进技术研究院 Image classification design methods and device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
KAI FAN等: "Learning a generative classifier from label proportions", 《NEUROCOMPUTING》, pages 47 - 55 *
蒋俊钊;程良伦;李全杰;: "基于标签相关性的卷积神经网络多标签分类算法", 工业控制计算机, no. 07, pages 108 - 109 *

Cited By (33)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112488141B (en) * 2019-09-12 2023-04-07 中移(苏州)软件技术有限公司 Method and device for determining application range of Internet of things card and computer readable storage medium
CN112488141A (en) * 2019-09-12 2021-03-12 中移(苏州)软件技术有限公司 Method and device for determining application range of Internet of things card and computer readable storage medium
CN112529024A (en) * 2019-09-17 2021-03-19 株式会社理光 Sample data generation method and device and computer readable storage medium
CN110648325B (en) * 2019-09-29 2022-05-17 北部湾大学 Nixing pottery big data analysis genuine product checking system
CN110648325A (en) * 2019-09-29 2020-01-03 北部湾大学 Nixing pottery big data analysis genuine product checking system
CN112825144A (en) * 2019-11-20 2021-05-21 深圳云天励飞技术有限公司 Picture labeling method and device, electronic equipment and storage medium
CN112825144B (en) * 2019-11-20 2024-06-07 深圳云天励飞技术有限公司 Picture marking method and device, electronic equipment and storage medium
CN111291823A (en) * 2020-02-24 2020-06-16 腾讯科技(深圳)有限公司 Fusion method and device of classification models, electronic equipment and storage medium
CN111291823B (en) * 2020-02-24 2023-08-18 腾讯科技(深圳)有限公司 Fusion method and device of classification model, electronic equipment and storage medium
CN111582360A (en) * 2020-05-06 2020-08-25 北京字节跳动网络技术有限公司 Method, apparatus, device and medium for labeling data
CN111582360B (en) * 2020-05-06 2023-08-15 北京字节跳动网络技术有限公司 Method, apparatus, device and medium for labeling data
CN111860572B (en) * 2020-06-04 2024-01-26 北京百度网讯科技有限公司 Data set distillation method, device, electronic equipment and storage medium
CN111860572A (en) * 2020-06-04 2020-10-30 北京百度网讯科技有限公司 Data set distillation method, device, electronic equipment and storage medium
CN112102062A (en) * 2020-07-24 2020-12-18 北京淇瑀信息科技有限公司 Risk assessment method and device based on weak supervised learning and electronic equipment
CN112115267A (en) * 2020-09-28 2020-12-22 平安科技(深圳)有限公司 Training method, device and equipment of text classification model and storage medium
CN112115267B (en) * 2020-09-28 2023-07-07 平安科技(深圳)有限公司 Training method, device, equipment and storage medium of text classification model
CN111967541A (en) * 2020-10-21 2020-11-20 上海冰鉴信息科技有限公司 Data classification method and device based on multi-platform samples
CN111967541B (en) * 2020-10-21 2021-01-05 上海冰鉴信息科技有限公司 Data classification method and device based on multi-platform samples
CN112199502B (en) * 2020-10-26 2024-03-15 网易(杭州)网络有限公司 Verse generation method and device based on emotion, electronic equipment and storage medium
CN112199502A (en) * 2020-10-26 2021-01-08 网易(杭州)网络有限公司 Emotion-based poetry sentence generation method and device, electronic equipment and storage medium
CN112651447B (en) * 2020-12-29 2023-09-26 广东电网有限责任公司电力调度控制中心 Ontology-based resource classification labeling method and system
CN112651447A (en) * 2020-12-29 2021-04-13 广东电网有限责任公司电力调度控制中心 Resource classification labeling method and system based on ontology
CN112925958A (en) * 2021-02-05 2021-06-08 深圳力维智联技术有限公司 Multi-source heterogeneous data adaptation method, device, equipment and readable storage medium
WO2022174496A1 (en) * 2021-02-20 2022-08-25 平安科技(深圳)有限公司 Data annotation method and apparatus based on generative model, and device and storage medium
CN112860919A (en) * 2021-02-20 2021-05-28 平安科技(深圳)有限公司 Data labeling method, device and equipment based on generative model and storage medium
CN113590677A (en) * 2021-07-14 2021-11-02 上海淇玥信息技术有限公司 Data processing method and device and electronic equipment
CN113344916A (en) * 2021-07-21 2021-09-03 上海媒智科技有限公司 Method, system, terminal, medium and application for acquiring machine learning model capability
CN113761925B (en) * 2021-07-23 2022-10-28 中国科学院自动化研究所 Named entity identification method, device and equipment based on noise perception mechanism
CN113761925A (en) * 2021-07-23 2021-12-07 中国科学院自动化研究所 Named entity identification method, device and equipment based on noise perception mechanism
CN113836118B (en) * 2021-11-24 2022-03-08 亿海蓝(北京)数据技术股份公司 Ship static data supplementing method and device, electronic equipment and readable storage medium
CN113836118A (en) * 2021-11-24 2021-12-24 亿海蓝(北京)数据技术股份公司 Ship static data supplementing method and device, electronic equipment and readable storage medium
CN114064973B (en) * 2022-01-11 2022-05-03 人民网科技(北京)有限公司 Video news classification model establishing method, classification method, device and equipment
CN114064973A (en) * 2022-01-11 2022-02-18 人民网科技(北京)有限公司 Video news classification model establishing method, classification method, device and equipment

Similar Documents

Publication Publication Date Title
CN110196908A (en) Data classification method, device, computer installation and storage medium
CN113822494B (en) Risk prediction method, device, equipment and storage medium
CN110019770A (en) The method and apparatus of train classification models
CN111324696B (en) Entity extraction method, entity extraction model training method, device and equipment
WO2021073390A1 (en) Data screening method and apparatus, device and computer-readable storage medium
CN109599187A (en) A kind of online interrogation point examines method, server, terminal, equipment and medium
CN109902672A (en) Image labeling method and device, storage medium, computer equipment
CN110442859A (en) Method, device and equipment for generating labeled corpus and storage medium
CN110458600A (en) Portrait model training method, device, computer equipment and storage medium
CN107807958A (en) A kind of article list personalized recommendation method, electronic equipment and storage medium
WO2023029507A1 (en) Data analysis-based service distribution method and apparatus, device, and storage medium
CN110807086A (en) Text data labeling method and device, storage medium and electronic equipment
CA3149895A1 (en) Machine learning system for summarizing tax documents with non-structured portions
CN110008365A (en) A kind of image processing method, device, equipment and readable storage medium storing program for executing
JP7347179B2 (en) Methods, devices and computer programs for extracting web page content
CN112287656A (en) Text comparison method, device, equipment and storage medium
CN104077408B (en) Extensive across media data distributed semi content of supervision method for identifying and classifying and device
CN108735198A (en) Phoneme synthesizing method, device based on medical conditions data and electronic equipment
CN113723077B (en) Sentence vector generation method and device based on bidirectional characterization model and computer equipment
CN112101488B (en) Training method and device for machine learning model and storage medium
CN113837307A (en) Data similarity calculation method and device, readable medium and electronic equipment
CN108629381A (en) Crowd's screening technique based on big data and terminal device
CN109657710B (en) Data screening method and device, server and storage medium
CN116978087A (en) Model updating method, device, equipment, storage medium and program product
WO2021135330A1 (en) Image sample selection method and related apparatus

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination