CN111178533B - Method and device for realizing automatic semi-supervised machine learning - Google Patents

Method and device for realizing automatic semi-supervised machine learning Download PDF

Info

Publication number
CN111178533B
CN111178533B CN201811341910.0A CN201811341910A CN111178533B CN 111178533 B CN111178533 B CN 111178533B CN 201811341910 A CN201811341910 A CN 201811341910A CN 111178533 B CN111178533 B CN 111178533B
Authority
CN
China
Prior art keywords
data set
machine learning
coefficient
semi
supervised machine
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811341910.0A
Other languages
Chinese (zh)
Other versions
CN111178533A (en
Inventor
王海
李宇峰
涂威威
魏通
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
4Paradigm Beijing Technology Co Ltd
Original Assignee
4Paradigm Beijing Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 4Paradigm Beijing Technology Co Ltd filed Critical 4Paradigm Beijing Technology Co Ltd
Priority to CN201811341910.0A priority Critical patent/CN111178533B/en
Publication of CN111178533A publication Critical patent/CN111178533A/en
Application granted granted Critical
Publication of CN111178533B publication Critical patent/CN111178533B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method and a device for realizing automatic semi-supervised machine learning, relates to the technical field of machine learning, and mainly aims to solve the problem of labor consumption in the existing semi-supervised machine learning process. The main technical scheme of the invention is as follows: acquiring a target data set; selecting one experience data set similar to the target data set, and determining a semi-supervised machine learning algorithm used on the experience data set as the semi-supervised machine learning algorithm of the target data set; respectively carrying out model training and prediction on the target data set according to a semi-supervised machine learning algorithm and a plurality of corresponding groups of super parameters to obtain a model and a prediction result corresponding to each group of super parameters, and selecting a group of super parameters suitable for the target data set from the plurality of groups of super parameters according to the prediction result corresponding to each group of super parameters; a set of hyper-parametric correspondence models that fit the target data set is determined as a semi-supervised machine learning model of the target data set. The invention is used for semi-supervised machine learning of a target data set.

Description

Method and device for realizing automatic semi-supervised machine learning
Technical Field
The invention relates to the technical field of machine learning, in particular to a method and a device for realizing automatic semi-supervised machine learning.
Background
With the continuous progress of technology, artificial intelligence technology has also evolved. Among them, machine learning is an inevitable product of the development of artificial intelligence research to a certain stage, which is directed to improving the performance of the system itself by means of computation using experience. In computer systems, "experience" is usually present in the form of "data" from which "models" can be generated by means of machine learning algorithms, i.e. by providing experience data to the machine learning algorithm, a model can be generated based on these experience data, which model provides corresponding decisions, i.e. predictions, in the face of new situations. Whether the machine learning model is trained or predicted using the trained machine learning model, the data needs to be converted into machine learning samples that include various features.
Currently, three types of machine learning are known, i.e., supervised machine learning, unsupervised machine learning, and semi-supervised machine learning. Wherein supervised machine learning requires model training with labeled data, which can be understood as data of known inputs and outputs. Unsupervised machine learning is then a training for finding hidden data relationships among unlabeled sample data for model training. Semi-supervised machine learning requires training of the model with the aid of a large amount of unlabeled data and a small amount of labeled data in the sample data. Because in many cases, the sample data has a difficult way of obtaining marked data, the number of marked data is far less than that of unmarked data. For example, in the case of web page classification, the user actually marks very few web pages, so that there are a large number of unlabeled web pages and a small number of labeled web pages in the entire internet. Also for example, in the case of medical image classification, a large number of medical images may be collected from a hospital, but a lot of labor is required for a doctor to mark the images, so that a large number of unlabeled images and a small number of marked images are available in the medical image classification task. In this case, training of a model for subsequent classification and recognition by semi-supervised machine learning based on the characteristics of the semi-supervised machine learning becomes a major approach in the field.
However, in tasks such as web page classification or medical image classification, a lot of computing resources are often consumed when model training is performed based on semi-supervised machine learning, and manual intervention is also required, so that automation is difficult to realize. The process of manual intervention is severely dependent on the modeling operator's business experience and the degree of algorithm familiarity, and the resulting model performance is not necessarily superior.
Disclosure of Invention
In view of the above problems, the present invention provides a method and apparatus for implementing automatic semi-supervised machine learning, which mainly aims to implement the function of automatic semi-supervised machine learning, so as to solve the problem of manpower consumption in the existing semi-supervised machine learning process.
In order to achieve the above purpose, the present invention mainly provides the following technical solutions:
in one aspect, the present invention provides a method for implementing automatic semi-supervised machine learning, specifically including:
acquiring a target data set, wherein part of sample data in the target data set is provided with marks;
selecting one experience data set similar to the target data set, and determining a semi-supervised machine learning algorithm used on the experience data set as the semi-supervised machine learning algorithm of the target data set;
Respectively carrying out model training and prediction on the target data set according to the semi-supervised machine learning algorithm and the corresponding multiple groups of super parameters to obtain a model and a prediction result corresponding to each group of super parameters, and selecting a group of super parameters suitable for the target data set from the multiple groups of super parameters according to the prediction result corresponding to each group of super parameters;
a model corresponding to the selected set of hyper-parameters appropriate for the target dataset is determined as a semi-supervised machine learning model for the target dataset.
Optionally, the selecting one of the empirical data sets that is similar to the target data set includes:
acquiring a plurality of experience data sets;
extracting corresponding dataset features from the target dataset and the plurality of empirical datasets, respectively;
an empirical data set similar to the target data set is determined from the plurality of empirical data sets based on the dataset characteristics.
Optionally, the dataset features include traditional meta-features and meta-features based on unsupervised clustering;
the extracting the corresponding dataset features from the target dataset and the plurality of empirical datasets, respectively, includes:
extracting traditional meta-features from the target data set, extracting corresponding meta-features based on the unsupervised clustering from the target data set according to a preset unsupervised clustering algorithm, and combining the extracted traditional features with the meta-features based on the unsupervised clustering to obtain data set features of the target data set;
For each empirical data set of the plurality of empirical data sets, extracting conventional meta-features from the empirical data set, and extracting corresponding meta-features based on unsupervised clustering from the empirical data set according to the preset unsupervised clustering algorithm, and combining the extracted conventional features and the meta-features based on unsupervised clustering to obtain the dataset features of the empirical data set.
Optionally, the legacy meta-features include any one or more of the following features:
the method comprises the steps of sample number, sample number logarithm, characteristic dimension logarithm, dataset dimension logarithm, inverse dataset dimension logarithm, minimum class prior probability, maximum class prior probability, average class prior probability, class prior probability standard deviation, minimum kurtosis coefficient, maximum kurtosis coefficient, average kurtosis coefficient, kurtosis coefficient standard deviation, minimum devism coefficient, maximum devism coefficient, average devism coefficient, devism coefficient standard deviation, PCA95% principal component, first principal component devism coefficient and first principal component kurtosis coefficient;
wherein the dataset dimension is a ratio of the feature dimension to the number of samples;
the inverse dataset dimension is the inverse of the dataset dimension;
The minimum class prior probability is the minimum value in the numerical value obtained by dividing the number of samples of each class by the total number of the samples;
the maximum class prior probability is the maximum value of the numerical values obtained by dividing the number of samples of each class by the total number of the samples;
the average class prior probability is obtained by dividing the number of samples of each class by the total number of samples;
the class prior probability standard deviation is calculated by a plurality of numerical values obtained by dividing the number of samples of each class by the total number of the samples;
the peak state coefficient is used for measuring the form of the data distribution of the data set relative to the two end parts of the normal distribution, the minimum peak state coefficient is the minimum value of all continuous characteristic peak state coefficients, the maximum peak state coefficient is the maximum value of all continuous characteristic peak state coefficients, the average peak state coefficient is the average value of all continuous characteristic peak state coefficients, and the standard deviation of the peak state coefficient is the standard deviation of all continuous characteristic peak state coefficients;
the bias coefficient is used for measuring the symmetry of the data distribution of the data set about the mean value of the data distribution, the minimum bias coefficient is the minimum value of all the continuous characteristic bias coefficients, the maximum bias coefficient is the maximum value of all the continuous characteristic bias coefficients, the average bias coefficient is the mean value of all the continuous characteristic bias coefficients, and the standard deviation of the bias coefficient is the standard deviation of all the continuous characteristic bias coefficients;
The PCA meta-feature is used for representing statistics of principal components in a data set, wherein the principal components of the PCA of 95% are obtained by reserving d' principal components from large to small according to variance after principal component analysis is carried out on a sample, and d is a feature dimension;
the first principal component bias state coefficient is the bias state coefficient of the largest principal component of the PCA meta-characteristic, and the first principal component kurtosis coefficient is the kurtosis coefficient of the largest principal component of the PCA meta-characteristic.
Optionally, the meta-feature based on unsupervised clustering includes one or more of the following:
tightness within class;
degree of separation between classes;
davison burg index;
dane index.
Optionally, the preset unsupervised clustering algorithm includes one or more unsupervised clustering algorithms;
when a plurality of unsupervised clustering algorithms are included, corresponding unsupervised clustering-based meta-features are extracted from the dataset according to each unsupervised clustering algorithm.
Optionally, the selecting, according to the prediction result corresponding to each set of super parameters, a set of super parameters suitable for the target data set from the multiple sets of super parameters includes:
determining classification intervals in each prediction result according to the maximum interval criterion;
And selecting a group of super parameters corresponding to the prediction result with the largest classification interval.
Optionally, the method further comprises:
determining a supervised machine learning model trained from the tagged data in the target dataset as a reference supervised machine learning model for the target dataset;
based on the marked data in the target data set, respectively performing cross verification on a semi-supervised machine learning model and a reference supervised machine learning model of the target data set to respectively obtain evaluation values corresponding to the semi-supervised machine learning model and the reference supervised machine learning model;
and determining one of the semi-supervised machine learning model and the reference supervised machine learning model as a final model suitable for the target data set according to the evaluation value.
In another aspect, the present invention provides an automatic semi-supervised machine learning apparatus, specifically including:
an acquisition unit configured to acquire a target data set, a part of sample data in the target data set having a flag;
a first determining unit configured to select one experience data set similar to a target data set, and determine a semi-supervised machine learning algorithm used on the experience data set as the semi-supervised machine learning algorithm of the target data set;
The selection unit is used for respectively carrying out model training and prediction on the target data set according to the semi-supervised machine learning algorithm and the corresponding multiple groups of super parameters to obtain a model and a prediction result corresponding to each group of super parameters, and selecting a group of super parameters suitable for the target data set from the multiple groups of super parameters according to the prediction result corresponding to each group of super parameters;
a second determining unit for determining a model corresponding to the selected set of hyper-parameters fitting the target data set as a semi-supervised machine learning model of the target data set.
Optionally, the first determining unit includes:
an acquisition module for acquiring a plurality of experience data sets;
the extraction module is used for respectively extracting corresponding data set characteristics from the target data set and the plurality of experience data sets;
a determining module for determining an empirical data set from the plurality of empirical data sets that is similar to the target data set based on the dataset characteristics.
Optionally, the dataset features include traditional meta-features and meta-features based on unsupervised clustering;
the extraction module comprises:
the first extraction sub-module is used for extracting traditional meta-features from the target data set, extracting corresponding meta-features based on the unsupervised clustering from the target data set according to a preset unsupervised clustering algorithm, and combining the extracted traditional features with the meta-features based on the unsupervised clustering to obtain data set features of the target data set;
And the second extraction sub-module is used for extracting traditional meta-characteristics from each experience data set in the plurality of experience data sets, extracting corresponding meta-characteristics based on the unsupervised clustering from the experience data sets according to the preset unsupervised clustering algorithm, and combining the extracted traditional characteristics and the meta-characteristics based on the unsupervised clustering to obtain the data set characteristics of the experience data set.
Optionally, the legacy meta-features include any one or more of the following features:
the method comprises the steps of sample number, sample number logarithm, characteristic dimension logarithm, dataset dimension logarithm, inverse dataset dimension logarithm, minimum class prior probability, maximum class prior probability, average class prior probability, class prior probability standard deviation, minimum kurtosis coefficient, maximum kurtosis coefficient, average kurtosis coefficient, kurtosis coefficient standard deviation, minimum devism coefficient, maximum devism coefficient, average devism coefficient, devism coefficient standard deviation, PCA95% principal component, first principal component devism coefficient and first principal component kurtosis coefficient;
wherein the dataset dimension is a ratio of the feature dimension to the number of samples;
The inverse dataset dimension is the inverse of the dataset dimension;
the minimum class prior probability is the minimum value in the numerical value obtained by dividing the number of samples of each class by the total number of the samples;
the maximum class prior probability is the maximum value of the numerical values obtained by dividing the number of samples of each class by the total number of the samples;
the average class prior probability is obtained by dividing the number of samples of each class by the total number of samples;
the class prior probability standard deviation is calculated by a plurality of numerical values obtained by dividing the number of samples of each class by the total number of the samples;
the peak state coefficient is used for measuring the form of the data distribution of the data set relative to the two end parts of the normal distribution, the minimum peak state coefficient is the minimum value of all continuous characteristic peak state coefficients, the maximum peak state coefficient is the maximum value of all continuous characteristic peak state coefficients, the average peak state coefficient is the average value of all continuous characteristic peak state coefficients, and the standard deviation of the peak state coefficient is the standard deviation of all continuous characteristic peak state coefficients;
the bias coefficient is used for measuring the symmetry of the data distribution of the data set about the mean value of the data distribution, the minimum bias coefficient is the minimum value of all the continuous characteristic bias coefficients, the maximum bias coefficient is the maximum value of all the continuous characteristic bias coefficients, the average bias coefficient is the mean value of all the continuous characteristic bias coefficients, and the standard deviation of the bias coefficient is the standard deviation of all the continuous characteristic bias coefficients;
The PCA meta-feature is used for representing statistics of principal components in a data set, wherein the principal components of the PCA of 95% are obtained by reserving d' principal components from large to small according to variance after principal component analysis is carried out on a sample, and d is a feature dimension;
the first principal component bias state coefficient is the bias state coefficient of the largest principal component of the PCA meta-characteristic, and the first principal component kurtosis coefficient is the kurtosis coefficient of the largest principal component of the PCA meta-characteristic.
Optionally, the meta-feature based on unsupervised clustering includes one or more of the following:
tightness within class;
degree of separation between classes;
davison burg index;
dane index.
Optionally, the preset unsupervised clustering algorithm includes one or more unsupervised clustering algorithms;
and the extraction module is also used for respectively extracting the corresponding meta-characteristics based on the unsupervised clustering from the data set according to each unsupervised clustering algorithm when the plurality of unsupervised clustering algorithms are included.
Optionally, the selecting unit includes:
the determining module is used for determining the classification interval in each prediction result according to the maximum interval criterion;
and the selection module is used for selecting a group of super parameters corresponding to the prediction result with the largest classification interval.
Optionally, the apparatus further includes:
a third determining unit configured to determine a supervised machine learning model trained with the marked data in the target data set as a reference supervised machine learning model of the target data set;
the verification unit is used for respectively carrying out cross verification on the semi-supervised machine learning model and the reference supervised machine learning model of the target data set based on the marked data in the target data set to respectively obtain evaluation values corresponding to the semi-supervised machine learning model and the reference supervised machine learning model;
and a fourth determining unit configured to determine, according to the evaluation value, one of the semi-supervised machine learning model and the reference supervised machine learning model as a final model suitable for the target data set.
In another aspect, the present invention provides a computer readable storage medium, where the computer readable storage medium has a computer program stored thereon, where the computer program, when executed by one or more computing devices, implements the automatic semi-supervised machine learning method of the first aspect.
In another aspect, the present invention provides a system comprising one or more computing devices and one or more storage devices, the one or more storage devices having a computer program recorded thereon, which when executed by the one or more computing devices causes the one or more computing devices to implement the automated semi-supervised machine learning method of the first aspect.
By means of the technical scheme, the automatic semi-supervised machine learning method and device can acquire the target data set, select one experience data set similar to the target data set, determine the semi-supervised machine learning algorithm used on the experience data set as the semi-supervised machine learning algorithm of the target data set, respectively perform model training and prediction on the target data set according to the semi-supervised machine learning algorithm and corresponding multiple groups of super parameters to obtain a model and a prediction result corresponding to each group of super parameters, select one group of super parameters suitable for the target data set from the multiple groups of super parameters according to the prediction result corresponding to each group of super parameters, and finally determine the model corresponding to the selected group of super parameters suitable for the target data set as the semi-supervised machine learning model of the target data set, so that the function of automatic semi-supervised machine learning is achieved. Compared with the existing problem that manual intervention is required in the process of semi-supervised machine learning, the invention can determine the required semi-supervised machine learning algorithm according to the experience data set similar to the target data set, determine a group of hyper parameters suitable for the target data set from a plurality of groups of hyper parameters corresponding to the semi-supervised learning algorithm, and determine the semi-supervised machine learning model of the target data set according to the model corresponding to the suitable group of hyper parameters, thereby realizing the function of automatically performing semi-supervised learning and avoiding the problem of manpower consumption caused by manual intervention in the modeling and model selection process.
The foregoing description is only an overview of the present invention, and is intended to be implemented in accordance with the teachings of the present invention in order that the same may be more clearly understood and to make the same and other objects, features and advantages of the present invention more readily apparent.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to designate like parts throughout the figures. In the drawings:
FIG. 1 illustrates a flow chart of a method for implementing automatic semi-supervised machine learning according to an embodiment of the present invention;
FIG. 2 illustrates a flow chart of another method for implementing automatic semi-supervised machine learning according to an embodiment of the present invention;
FIG. 3 is a diagram showing an exemplary process for determining optimal superparameters based on maximum spacing criteria according to an embodiment of the present invention;
FIG. 4 shows a block diagram of one implementation of an automated semi-supervised machine learning apparatus, in accordance with an embodiment of the present invention;
Fig. 5 shows a block diagram of another implementation of an automatic semi-supervised machine learning apparatus, according to an embodiment of the present invention.
Detailed Description
With the advent of massive data, artificial intelligence technology has evolved rapidly, and in order to mine bid values from massive data, related personnel are required to not only become sophisticated with artificial intelligence technology (especially machine learning technology), but also become very familiar with the specific scenarios (e.g., image processing, speech processing, automatic control, financial services, internet advertising, etc.) in which machine learning technology is applied. For example, if the relevant personnel have insufficient knowledge of the business, or the modeling experience is insufficient, poor modeling results are likely to result. At present, the phenomenon can be relieved from two aspects, namely, the threshold of machine learning is reduced, so that a machine learning algorithm is easy to get up; and secondly, the model precision is improved, so that the algorithm universality is high, and better results can be generated. It will be appreciated that these two aspects are not contradictory, such as the improvement in the effectiveness of the algorithm in the second aspect, which may assist the first aspect. In addition, when it is desired to make a corresponding target prediction using a neural network model, the relevant person needs to be familiar with not only various complicated technical details about the neural network, but also understand the business logic behind the data related to the predicted target, for example, if it is desired to use a machine learning model to distinguish criminal suspects, the relevant person also has to understand which characteristics are likely to be possessed by the criminal suspects; if the machine learning model is used for judging the fraudulent transaction of the financial industry, related personnel also have to know the transaction habit of the financial industry, a series of corresponding expert rules and the like. The above-mentioned variety brings great difficulty to the application prospect of machine learning technology.
For this reason, the skilled person would like to solve the above problems by technical means, and reduce the threshold for model training and application while effectively improving the effect of the neural network model. In this process, for example, in order to obtain a practical and effective model, there are many technical problems, such as not only aiming at the non-ideal training data (for example, lack of training data, sparse training data, and distribution difference between training data and predicted data), but also solving the problem of calculation efficiency of massive data. That is, it is practically impossible to rely on infinitely complex ideal models, with perfect training data sets to solve the execution of machine learning processes. As a data processing system or method for prediction purposes, any scheme for training a model or predicting by using a model must be subject to objectively existing data limitations and computational resource limitations, and the above technical problems are solved by using a specific data processing mechanism in a computer. These data processing mechanisms rely on the processing power, manner of processing, and processing data of a computer and are not purely mathematical or statistical calculations.
Exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present invention are shown in the drawings, it should be understood that the present invention may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.
The embodiment of the invention provides an automatic semi-supervised machine learning method, which can realize the automatic semi-supervised machine learning of a target data set without manual intervention, thereby solving the problem of manpower consumption caused by the existing manual intervention semi-supervised machine learning process and the problem of serious dependence on the business experience and the professional degree of a modeler. The specific steps of the method are shown in fig. 1, and the method comprises the following steps:
101. a target dataset is acquired.
Wherein a portion of the sample data in the target dataset has a marker. In the embodiment of the present invention, the target data set may be any data set such as a web page data set or a medical image data set. In the embodiment of the present invention, the marked data may be understood as data of a known classification result, and the unmarked data may be understood as data of an unknown classification result.
In the embodiment of the present invention, the process of acquiring the target data set may be performed according to any existing acquisition mode, for example, an interface dedicated to inputting the target data set may be provided, and the target data set may be acquired through the interface.
102. One experience data set similar to the target data set is selected, and a semi-supervised machine learning algorithm used on the experience data set is determined as the semi-supervised machine learning algorithm of the target data set.
Here, an empirical data set refers to a data set on which semi-supervised learning has been performed and on which a better performing semi-supervised algorithm is known.
In practical applications, there is similarity in the distribution of data for many data sets, since the distribution of data for different data sets is different, based on the distribution of data. In this embodiment, a semi-supervised algorithm used on an empirical data set similar to the target data set is selected for the target data set based on the principle that similar data sets have the same preferences for the learning algorithm.
Therefore, based on the similarity of the data sets and the characteristics of algorithm selection, in the process of performing the semi-supervised machine learning according to the embodiment of the present invention, after the target data set is acquired in the foregoing step 101, in this step, the experience data set similar to the target data set may be determined first by using the target data set. In an embodiment of the present invention, there are a plurality of empirical data sets, and algorithms for optimal semi-supervised learning corresponding to each empirical data set are known. The semi-supervised machine learning algorithm corresponding to the empirical data set most similar to the target data set is then initialized to the semi-supervised machine learning algorithm for the target data set.
103. And respectively carrying out model training and prediction on the target data set according to the semi-supervised machine learning algorithm and the corresponding multiple groups of super parameters to obtain a model and a prediction result corresponding to each group of super parameters, and selecting a group of super parameters suitable for the target data set from the multiple groups of super parameters according to the prediction result corresponding to each group of super parameters.
After the semi-supervised machine learning algorithms are selected, since each semi-supervised machine learning algorithm has a corresponding plurality of sets of superparameters, however, the model corresponding to each set of superparameters is not applicable to the target dataset, and therefore, in this step, one set of superparameters suitable for the target dataset needs to be selected from the plurality of sets of superparameters. In the process of selecting a group of hyper-parameters suitable for the target data set, training and predicting a model on the target data set according to the determined semi-supervised learning algorithm and a plurality of groups of hyper-parameters respectively to obtain prediction result data corresponding to each group of hyper-parameters. For example, the semi-supervised algorithm is a, the multiple sets of hyper-parameters are B1, B2, … … Bn, respectively, and then training and predicting on the target data set according to a+b1, training and predicting on the target data set according to a+b2, … …, and training and predicting on the target data set according to a+bn. Since there are multiple sets of superparameters, in this embodiment, according to the prediction result, the superparameter of the set that most matches the target data set is taken as the set of superparameters corresponding to the target data set.
104. A model corresponding to the selected set of hyper-parameters appropriate for the target dataset is determined as a semi-supervised machine learning model for the target dataset.
After determining a set of superparameters that are suitable for the target dataset based on step 103, since the semi-supervised machine learning model corresponding to each set of superparameters is also trained in step 103, when the set of superparameters that are most suitable for the target dataset is determined, the model corresponding to the set of superparameters is actually determined to be the model suitable for the target dataset. Thus, the model corresponding to the set of hyper-parameters obtained in the corresponding step 103 may be determined as a semi-supervised machine learning model corresponding to the target data set according to the method described in this step.
In an embodiment of the present invention, the target data may be other types of image data, voice data, data for describing engineering control objects, data for describing users (or their behaviors), data for describing objects and/or events in various fields of administration, business, medical, supervision, finance, etc., in addition to web page data or medical image data.
Further, as a further refinement and extension of the foregoing embodiment, in an embodiment of the present invention, another method for implementing automatic semi-supervised machine learning is provided, specifically as shown in fig. 2, where the steps include:
201. A target dataset is acquired.
The target data set may be a data set of web page data or a data set of medical image data, which is not limited herein, and may be determined according to actual situations. Wherein a portion of the sample data in the target dataset has markers and the remainder is unlabeled data. In addition, in the embodiment of the present invention, the manner of obtaining the target data set, the marked data, the unmarked data, and the target data set is the same as that described in step 101 in the previous embodiment, and will not be described herein.
202. One experience data set similar to the target data set is selected, and a semi-supervised machine learning algorithm used on the experience data set is determined as the semi-supervised machine learning algorithm of the target data set.
In this step, when selecting one of the empirical data sets similar to the target data set, it is specifically performed as follows: first, a plurality of empirical data sets are acquired. Then, corresponding dataset features are extracted from the target dataset and the plurality of empirical datasets, respectively. The data set features are used for characterizing data distribution, data set structure and the like in the data set. Finally, an empirical data set similar to the target data set is determined from the plurality of empirical data sets based on the dataset characteristics. In this way, the experience data set similar to the target data set can be determined from the experience data sets according to the data set characteristics, the accuracy of the selected experience data set is ensured, the foundation is laid for the selection of the follow-up semi-supervised machine learning algorithm, and the accuracy of the whole automatic semi-supervised machine learning is ensured.
Further, the dataset features may include traditional meta-features as well as meta-features based on unsupervised clustering. Here, the specific way of extracting the corresponding dataset features from the target dataset and the plurality of experience datasets respectively may include: extracting traditional meta-features from the target data set, extracting corresponding meta-features based on the unsupervised clustering from the target data set according to a preset unsupervised clustering algorithm, and combining the extracted traditional meta-features and the meta-features based on the unsupervised clustering to obtain the data set features of the target data set. For each of the plurality of empirical data sets, extracting conventional meta-features from the empirical data set, extracting corresponding non-supervised cluster-based meta-features from the empirical data set according to the preset non-supervised clustering algorithm, and combining the extracted conventional meta-features and the non-supervised cluster-based meta-features to obtain dataset features for the empirical data set. Therefore, the feature of the data set is determined by combining the feature extracted by the unsupervised clustering algorithm and based on unsupervised clustering with the traditional feature, so that the accuracy of the feature of the data set of the target data set and the feature of the experience data set can be ensured, and further, the accuracy of the experience data set similar to the target data set is ensured.
It should be noted that, in the embodiment of the present invention, the selected preset unsupervised clustering algorithm may be one or more unsupervised clustering algorithms, for example, an unsupervised clustering algorithm such as spectral clustering, k-means clustering, hierarchical clustering, and the like. And when a plurality of unsupervised clustering algorithms are selected, in the step, corresponding unsupervised clustering-based meta-features can be extracted from the same dataset according to each unsupervised clustering algorithm, so that the extraction of the unsupervised clustering-based meta-features of the same dataset can be performed based on the plurality of unsupervised clustering algorithms, the distribution of the dataset can be more accurately represented by the unsupervised clustering-based meta-features, and the accuracy of the determined experience dataset similar to the target dataset can be further improved.
Furthermore, in an embodiment of the present invention, the conventional meta-feature may include any one or more of the following: the method comprises the steps of sample number, sample number logarithm, characteristic dimension logarithm, dataset dimension logarithm, inverse dataset dimension logarithm, minimum class prior probability, maximum class prior probability, average class prior probability, class prior probability standard deviation, minimum kurtosis coefficient, maximum kurtosis coefficient, average kurtosis coefficient, kurtosis coefficient standard deviation, minimum deviational coefficient, maximum deviational coefficient, average deviational coefficient, deviational coefficient standard deviation, PCA95% principal component, first principal component deviational coefficient, and first principal component kurtosis coefficient.
Wherein, in the above-mentioned traditional meta-feature, the dimension of the dataset is the ratio of the feature dimension to the number of samples; the inverse dataset dimension is the inverse of the dataset dimension; the minimum class prior probability is the minimum value in the numerical value obtained by dividing the number of samples of each class by the total number of the samples; the maximum class prior probability is the maximum value of the numerical values obtained by dividing the number of samples of each class by the total number of the samples; the average class prior probability is obtained by dividing the number of samples of each class by the total number of samples; the class prior probability standard deviation is a standard deviation calculated by a plurality of numerical values obtained by dividing the number of samples of each class by the total number of samples.
In addition, in the embodiment of the present invention, the kurtosis coefficient is used to measure the morphology of the data distribution of the data set relative to the two end portions of the normal distribution, where the kurtosis coefficient β may be defined as:
where μX is the mean of the continuous variable and σX is the standard deviation of the continuous variable.
And the minimum kurtosis coefficient is the minimum value of all continuous characteristic kurtosis coefficients; the maximum kurtosis coefficient is the maximum value of all continuous characteristic kurtosis coefficients; the average kurtosis coefficient is the average value of all continuous characteristic kurtosis coefficients; the standard deviation of the kurtosis coefficient is the standard deviation of the kurtosis coefficient of all continuous characteristics.
The skewness factor is used to measure the symmetry of the data distribution of the data set about its mean. Wherein the skewness factor γ can be defined as:
where μX is the mean of the continuous variable and σX is the standard deviation of the continuous variable.
The minimum bias coefficient is the minimum value of all continuous characteristic bias coefficients; the maximum bias coefficient is the maximum value of all continuous characteristic bias coefficients; the average bias coefficient is the average value of all continuous characteristic bias coefficients; the standard deviation of the skewing factor is the standard deviation of all continuous characteristic skewing factors.
In addition, PCA meta-characteristics can be understood as statistics used for representing principal components in a dataset, while the PCA95% principal components in the embodiment of the invention are the variances of 95% of d '/d in the original data which are reserved for d' principal components from large to small according to the variances after principal component analysis is performed on a sample, wherein d is a characteristic dimension;
the first principal component bias state coefficient is the bias state coefficient of the largest principal component of the PCA meta-characteristic, and the first principal component kurtosis coefficient is the kurtosis coefficient of the largest principal component of the PCA meta-characteristic.
Still further, in an embodiment of the present invention, the meta-feature based on unsupervised clustering may include one or more of the following: intra-class compactness, inter-class separation, davison baudiner index, and dunne index.
The Intra-class compactness (Intra-class compactness) is obtained by calculating the average distance from each point in each class to the clustering center of the cluster and taking the average value of the average distances from the clustering centers of all classes, and specifically comprises the following steps:
in the above-mentioned formula(s),the degree of closeness of the ith class is calculated, where x i Is a sample belonging to the ith class, w i Is the cluster center of the ith class, +.>Is obtained by calculating the average of the distances of all samples in class i to the cluster center. />Is to calculate +.>Thereafter, an average is taken, where k represents the number of classifications of the data in the dataset. />Lower means closer class-to-class distances.
The Inter-class separation degree (Inter-cluster separation, abbreviated as Inter-class separation degree) is obtained by calculating the average distance between every two clustering centers, and specifically comprises the following steps:
in the above formula, w represents a cluster center in the dataset, and w i -w j Representing the distance between the cluster centers of the i-th class and the j-th class,representing the average value of the distances between all cluster centers in the dataset, wherein +.>Higher means more distant inter-class clusters.
Davison burg Ding Zhishu (Davies-Bouldin Index, abbreviated as DBI) is obtained by calculating the sum of the average of the intra-class distances of any two classes and dividing by the distance between the centers of the two clusters to obtain the maximum value, specifically:
Wherein in the above formula, k represents the number of classifications of data in the dataset;is the average distance of the data in the class to the cluster center, e.g.>Representing the degree of dispersion of each data in the ith class; w (w) i -w j The distance between the ith class and the jth class, i.e., the distance between the centers of two clusters. The smaller the DB value, the closer the clustering result is to the inside of the cluster, and the farther the different clusters are separated.
The dunn index (Dunn Validity Index, translated, dunn index, DVI for short) is obtained by dividing the shortest distance between classes of any two clusters by the largest distance between classes in any cluster, specifically:
in the above formula, the molecular part represents the minimum value of any inter-class distances, namely the shortest distance between clusters; the denominator part represents the maximum value of the middle distance in the class, namely the maximum distance in any class, and according to the formula, the larger the DVI value is, the closer the inner part of the class in the clustering result is indicated, and the different classes are separated farther. I.e. the larger the inter-class distance, the smaller the intra-class distance.
203. And respectively carrying out model training and prediction on the target data set according to the semi-supervised machine learning algorithm and the corresponding multiple groups of super parameters to obtain a model and a prediction result corresponding to each group of super parameters, and selecting a group of super parameters suitable for the target data set from the multiple groups of super parameters according to the prediction result corresponding to each group of super parameters.
After the foregoing step 202 determines the semi-supervised machine learning algorithm corresponding to the target data set, there are multiple sets of hyper-parameters based on each semi-supervised machine learning algorithm, and not all hyper-parameters are suitable for the target data set, if an erroneous set of hyper-parameters is selected, it is likely that the accuracy of the subsequently obtained semi-supervised machine learning model will result. Thus, a set of hyper-parameters that fit the target data set needs to be determined from among the sets of hyper-parameters in this step.
In this step, when a set of super parameters suitable for the target data set is selected according to the prediction result, the method may be as follows: and determining the classification interval in each predicted result according to the maximum interval criterion, and then selecting a group of super parameters corresponding to the predicted result with the maximum classification interval. The maximum interval criterion (LM) adopted in the embodiment of the present invention is used to predict the result of each data in the dataset through different models, and it is considered that the larger the classification interval is, the better the corresponding model performance is. In the embodiment, the method for determining the super-parameters suitable for the target data set based on the maximum interval criterion omits the complicated processes of verification set division and model evaluation, and improves the super-parameter selection efficiency. For example, the method in this step may be shown in fig. 3, where the predicted result can be obtained after the model training result prediction is performed on two sets of super parameters, and according to this figure, it can be seen that the positive-negative type interval of the model predicted result obtained by the super parameter 1 is larger than the positive-negative type interval of the model predicted result obtained by the super parameter c, so when the preset algorithm includes the two sets of super parameters 1 and c, it can be determined that the super parameter 1 has a better predicted effect, and then it can be determined that the set of super parameter 1 is the set of super parameters suitable for the data set in this example.
204. A model corresponding to the selected set of hyper-parameters appropriate for the target dataset is determined as a semi-supervised machine learning model for the target dataset.
In the process of determining a set of superparameters that are suitable for the target dataset according to the method of step 203, since the corresponding model is already trained, after determining the set of superparameters that are suitable for the target dataset according to step 203, the model corresponding to the set of superparameters is actually the semi-supervised machine learning model that is most suitable for the target dataset, where the model corresponding to the set of superparameters determined in step 203 may be determined as the semi-supervised machine learning model of the target dataset according to the method of step.
205. A supervised machine learning model trained on the tagged data in the target dataset is determined as a reference supervised machine learning model for the target dataset.
A common problem in semi-supervised learning is performance degradation, i.e. the model predictive performance obtained after using unlabeled data is instead worse than that obtained with only labeled data. Therefore, in the embodiment of the present invention, since the target data set includes the marked data, although such data is less, in order to further ensure the accuracy of the machine learning model corresponding to the target data set, in the embodiment of the present invention, the corresponding reference supervised machine learning model may also be trained by using the marked data set in the target data set according to the method described in this step. Here, the training process may be based on the prior art, which is not described in detail herein.
206. And respectively performing cross verification on the semi-supervised machine learning model and the reference supervised machine learning model of the target data set based on the marked data in the target data set to respectively obtain evaluation values corresponding to the semi-supervised machine learning model and the reference supervised machine learning model.
After training the corresponding reference supervised machine learning model in step 205, in order to determine which of the semi-supervised machine learning model and the reference supervised machine learning model obtained in step 204 and step 205 is better suited for the target data set, the semi-supervised machine learning model and the reference supervised machine learning model of the target data set are respectively cross-validated according to the marked data in the target data set according to the method described in the step, and evaluation values corresponding to the two models are respectively obtained.
Specifically, in the embodiment of the invention, the two models can be verified by a K-fold cross verification method, and corresponding evaluation values are obtained respectively. For example, assuming 100 pieces of tagged data, when K is 2, the tagged data may be divided into two A, B sets of 50 pieces of data each. On the one hand, a reference supervised machine learning model is trained on the group A data, and the group B data is used for verification to obtain an evaluation value X1 of the reference supervised machine learning model. On the other hand, after the marks of the group B are removed, a selected semi-supervised learning algorithm and a group of super parameters are used for training a model on marked group A data and unmarked group B data, after model training is completed, a prediction result on unmarked group B data can be obtained, and then the prediction result is compared with the real marks of the group B data to obtain an evaluation value Y1. Then, after the A, B two sets of data are exchanged, the above-described process is repeated and the evaluation value X2 and the evaluation value Y2 are obtained, respectively. Wherein the sum of X1 and X2 is the evaluation value of the reference supervised machine learning model, and the sum of Y1 and Y2 is the evaluation value of the semi-supervised machine learning model.
207. And determining one of the semi-supervised machine learning model and the reference supervised machine learning model as a final model suitable for the target data set according to the evaluation value.
After the evaluation values of the reference supervised machine learning model and the semi-supervised machine learning model are obtained in step 206, the evaluation values of the reference supervised machine learning model and the semi-supervised machine learning model may be compared according to the sizes of the evaluation values, so that a larger model is determined as a final model suitable for the target data set.
For example, according to the example of the foregoing step 206, when the sum of Y1 and Y2 is greater than the sum of X1 and X2, it may be determined that the prediction accuracy of the semi-supervised machine learning model for the result is better than the quasi-removal rate of the reference supervised machine learning model, and then it is determined that the semi-supervised machine learning model is selected as the final model; otherwise, determining the reference supervised machine learning model as a final model.
Therefore, the supervised machine learning model trained by the marked data in the target data set is determined to be the reference supervised machine learning model of the target data set, the semi-supervised machine learning model and the reference supervised machine learning model of the target data set are respectively subjected to cross verification based on the marked data in the target data set, evaluation values corresponding to the semi-supervised machine learning model and the reference supervised machine learning model are respectively obtained, one of the semi-supervised machine learning model and the reference supervised machine learning model is determined to be a final model suitable for the target data set according to the evaluation values, the effect that the reference supervised machine learning model can be selected as the final model can be ensured when the accuracy of the determined semi-supervised machine learning model is lower, the actual accuracy of the machine learning effect can be ensured, and the problem of performance degradation possibly occurring in the semi-supervised machine learning process is solved.
Further, as an implementation of the above-mentioned method for implementing automatic semi-supervised machine learning, the embodiment of the invention provides an apparatus for implementing automatic semi-supervised machine learning, which is mainly used for implementing a function of implementing automatic semi-supervised machine learning on a target data set, and solves a problem that human resources are consumed in a semi-supervised machine learning process due to manual intervention. For convenience of reading, the details of the foregoing method embodiment are not described one by one in the embodiment of the present apparatus, but it should be clear that the apparatus in this embodiment can correspondingly implement all the details of the foregoing method embodiment. The device is shown in fig. 4, and specifically comprises:
an acquisition unit 31 operable to acquire a target data set, a part of sample data in the target data set having a marker;
a first determining unit 32 operable to select one experience data set similar to the target data set, determine a semi-supervised machine learning algorithm used on the experience data set as the semi-supervised machine learning algorithm of the target data set acquired by the acquiring unit 31;
the selecting unit 33 may be configured to perform model training and prediction on the target data set according to the semi-supervised machine learning algorithm determined by the first determining unit 32 and the corresponding multiple sets of super parameters, to obtain a model and a prediction result corresponding to each set of super parameters, and select, from the multiple sets of super parameters, a set of super parameters suitable for the target data set according to the prediction result corresponding to each set of super parameters;
A second determining unit 34 for determining a model corresponding to the set of hyper-parameters suitable for the target data set selected by the selecting unit 33 as a semi-supervised machine learning model of the target data set.
Further, as shown in fig. 5, the first determining unit 32 includes:
an acquisition module 321, which may be used to acquire a plurality of experience data sets;
an extracting module 322, configured to extract corresponding dataset features from the target dataset and the plurality of experience datasets acquired by the acquiring module 321, respectively;
a determining module 323 is operable to determine an empirical data set similar to the target data set from the plurality of empirical data sets based on the dataset characteristics extracted by the extracting module 322.
Further, as shown in FIG. 5, the dataset features include traditional meta-features as well as meta-features based on unsupervised clustering;
the extraction module 322 includes:
a first extraction submodule 3221, configured to extract a conventional meta-feature from the target dataset, extract a corresponding meta-feature based on an unsupervised cluster from the target dataset according to a preset unsupervised clustering algorithm, and combine the extracted conventional feature and the meta-feature based on the unsupervised cluster to obtain a dataset feature of the target dataset;
The second extraction sub-module 3222 may be configured to, for each of the plurality of experience data sets, extract a conventional meta-feature from the experience data set, and extract a corresponding non-supervised clustering-based meta-feature from the experience data set according to the preset non-supervised clustering algorithm, and combine the extracted conventional feature and the non-supervised clustering-based meta-feature to obtain a dataset feature for the experience data set.
Further, as shown in fig. 5, the conventional meta-features include any one or more of the following features:
the method comprises the steps of sample number, sample number logarithm, characteristic dimension logarithm, dataset dimension logarithm, inverse dataset dimension logarithm, minimum class prior probability, maximum class prior probability, average class prior probability, class prior probability standard deviation, minimum kurtosis coefficient, maximum kurtosis coefficient, average kurtosis coefficient, kurtosis coefficient standard deviation, minimum devism coefficient, maximum devism coefficient, average devism coefficient, devism coefficient standard deviation, PCA95% principal component, first principal component devism coefficient and first principal component kurtosis coefficient;
wherein the dataset dimension is a ratio of the feature dimension to the number of samples;
The inverse dataset dimension is the inverse of the dataset dimension;
the minimum class prior probability is the minimum value in the numerical value obtained by dividing the number of samples of each class by the total number of the samples;
the maximum class prior probability is the maximum value of the numerical values obtained by dividing the number of samples of each class by the total number of the samples;
the average class prior probability is obtained by dividing the number of samples of each class by the total number of samples;
the class prior probability standard deviation is calculated by a plurality of numerical values obtained by dividing the number of samples of each class by the total number of the samples;
the peak state coefficient is used for measuring the form of the data distribution of the data set relative to the two end parts of the normal distribution, the minimum peak state coefficient is the minimum value of all continuous characteristic peak state coefficients, the maximum peak state coefficient is the maximum value of all continuous characteristic peak state coefficients, the average peak state coefficient is the average value of all continuous characteristic peak state coefficients, and the standard deviation of the peak state coefficient is the standard deviation of all continuous characteristic peak state coefficients;
the bias coefficient is used for measuring the symmetry of the data distribution of the data set about the mean value of the data distribution, the minimum bias coefficient is the minimum value of all the continuous characteristic bias coefficients, the maximum bias coefficient is the maximum value of all the continuous characteristic bias coefficients, the average bias coefficient is the mean value of all the continuous characteristic bias coefficients, and the standard deviation of the bias coefficient is the standard deviation of all the continuous characteristic bias coefficients;
The PCA meta-feature is used for representing statistics of principal components in a data set, wherein the principal components of the PCA of 95% are obtained by reserving d' principal components from large to small according to variance after principal component analysis is carried out on a sample, and d is a feature dimension;
the first principal component bias state coefficient is the bias state coefficient of the largest principal component of the PCA meta-characteristic, and the first principal component kurtosis coefficient is the kurtosis coefficient of the largest principal component of the PCA meta-characteristic.
Further, as shown in fig. 5, the meta-features based on unsupervised clustering include one or more of the following:
tightness within class;
degree of separation between classes;
davison burg index;
dane index.
Further, as shown in fig. 5, the preset unsupervised clustering algorithm includes one or more unsupervised clustering algorithms;
the extracting module 322 may be further configured to extract, when multiple unsupervised clustering algorithms are included, corresponding meta-features based on unsupervised clustering from the dataset according to each unsupervised clustering algorithm, respectively.
Further, as shown in fig. 5, the selecting unit 33 includes:
the determining module 331 may be configured to determine a classification interval in each prediction result according to a maximum interval criterion;
The selecting module 332 may be configured to select, according to the plurality of classification intervals determined by the determining module 331, a set of super parameters corresponding to the prediction result with the largest classification interval.
Further, as shown in fig. 5, the apparatus further includes:
a third determining unit 35 operable to determine a supervised machine learning model of the labeled data training in the target dataset as a reference supervised machine learning model of the target dataset;
a verification unit 36, configured to cross-verify the semi-supervised machine learning model of the target data set determined by the second determination unit 34 and the reference supervised machine learning model determined by the third determination unit 35 based on the marked data in the target data set, to obtain evaluation values corresponding to the semi-supervised machine learning model and the reference supervised machine learning model, respectively;
a fourth determining unit 37 configured to determine, based on the evaluation value obtained by the verifying unit 36, one of the semi-supervised machine learning model and the reference supervised machine learning model as a final model suitable for the target data set.
Further, an embodiment of the present invention further provides a computer readable storage medium, where the computer readable storage medium stores a computer program, where the computer program when executed by one or more computing devices implements the above-mentioned method for implementing automatic semi-supervised machine learning.
In addition, the embodiment of the invention further provides a system comprising one or more computing devices and one or more storage devices, wherein the one or more storage devices are recorded with a computer program, and the computer program is used for enabling the one or more computing devices to realize the automatic semi-supervised machine learning method when being executed by the one or more computing devices.
In summary, according to the method and the device for implementing automatic semi-supervised machine learning provided by the embodiments of the present invention, a target data set can be acquired, then one experience data set similar to the target data set is selected, a semi-supervised machine learning algorithm used on the experience data set is determined as a semi-supervised machine learning algorithm of the target data set, then model training and prediction are performed on the target data set according to the semi-supervised machine learning algorithm and corresponding multiple groups of super parameters, a model and a prediction result corresponding to each group of super parameters are obtained, a group of super parameters suitable for the target data set is selected from the multiple groups of super parameters according to the prediction result corresponding to each group of super parameters, and finally a model corresponding to the selected group of super parameters suitable for the target data set is determined as a semi-supervised machine learning model of the target data set, thereby implementing an automatic semi-supervised machine learning function. Compared with the existing problem that manual intervention is required in the process of semi-supervised machine learning, the method can determine the selection of a required semi-supervised machine learning algorithm according to the experience data set corresponding to the target data set, determine a group of hyper parameters suitable for the target data set from a plurality of groups of hyper parameters corresponding to the semi-supervised learning algorithm, and determine the semi-supervised machine model of the target data set according to the model corresponding to the suitable group of hyper parameters, so that the function of automatically performing semi-supervised learning is realized, and the problem of human consumption caused by manual intervention in the modeling and selecting processes is avoided.
In the foregoing embodiments, the descriptions of the embodiments are emphasized, and for parts of one embodiment that are not described in detail, reference may be made to related descriptions of other embodiments.
It will be appreciated that the relevant features of the methods and apparatus described above may be referenced to one another. In addition, the "first", "second", and the like in the above embodiments are for distinguishing the embodiments, and do not represent the merits and merits of the embodiments.
It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein.
The algorithms and displays presented herein are not inherently related to any particular computer, virtual system, or other apparatus. Various general-purpose systems may also be used with the teachings herein. The required structure for a construction of such a system is apparent from the description above. In addition, the present invention is not directed to any particular programming language. It will be appreciated that the teachings of the present invention described herein may be implemented in a variety of programming languages, and the above description of specific languages is provided for disclosure of enablement and best mode of the present invention.
Furthermore, the memory may include volatile memory, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM), in a computer readable medium, the memory including at least one memory chip.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In one typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, etc., such as Read Only Memory (ROM) or flash RAM. Memory is an example of a computer-readable medium.
Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises an element.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The foregoing is merely exemplary of the present application and is not intended to limit the present application. Various modifications and changes may be made to the present application by those skilled in the art. Any modifications, equivalent substitutions, improvements, etc. which are within the spirit and principles of the present application are intended to be included within the scope of the claims of the present application.

Claims (18)

1. A method for implementing automatic semi-supervised machine learning, applied to a business scenario for classification processing based on machine learning technology, wherein the method comprises the following steps:
obtaining a target data set, wherein part of sample data in the target data set is provided with marks, and the target data set is at least a webpage data set or a medical image data set;
Selecting one experience data set similar to the target data set, and determining a semi-supervised machine learning algorithm used on the experience data set as the semi-supervised machine learning algorithm of the target data set;
if the target data set is a web page data set, the semi-supervised machine learning algorithm used on the experience data set is: an algorithm for performing semi-supervised learning on an empirical data set similar to the web page data set to obtain optimal semi-supervised learning; if the target dataset is a medical image dataset, the semi-supervised machine learning algorithm used on the empirical dataset is: an algorithm for performing semi-supervised learning on an empirical data set similar to the medical image data set to obtain optimal semi-supervised learning;
respectively carrying out model training and prediction on the target data set according to the semi-supervised machine learning algorithm and the corresponding multiple groups of super parameters to obtain a model and a prediction result corresponding to each group of super parameters, and selecting a group of super parameters suitable for the target data set from the multiple groups of super parameters according to the prediction result corresponding to each group of super parameters;
a model corresponding to the selected set of hyper-parameters appropriate for the target dataset is determined as a semi-supervised machine learning model for the target dataset, the semi-supervised machine learning model being adapted to at least process web page classification traffic scenarios or medical image classification traffic scenarios.
2. The method of claim 1, wherein the selecting one of the empirical data sets that is similar to the target data set comprises:
acquiring a plurality of experience data sets;
extracting corresponding dataset features from the target dataset and the plurality of empirical datasets, respectively;
an empirical data set similar to the target data set is determined from the plurality of empirical data sets based on the dataset characteristics.
3. The method of claim 2, wherein the dataset features include traditional meta-features and non-supervised clustering-based meta-features;
the extracting the corresponding dataset features from the target dataset and the plurality of empirical datasets, respectively, includes:
extracting traditional meta-features from the target data set, extracting corresponding meta-features based on the unsupervised clustering from the target data set according to a preset unsupervised clustering algorithm, and combining the extracted traditional features with the meta-features based on the unsupervised clustering to obtain data set features of the target data set;
for each empirical data set of the plurality of empirical data sets, extracting conventional meta-features from the empirical data set, and extracting corresponding meta-features based on unsupervised clustering from the empirical data set according to the preset unsupervised clustering algorithm, and combining the extracted conventional features and the meta-features based on unsupervised clustering to obtain the dataset features of the empirical data set.
4. A method as claimed in claim 3, wherein the legacy meta-features include any one or more of the following features:
the method comprises the steps of sample number, sample number logarithm, characteristic dimension logarithm, data set dimension logarithm, inverse data set dimension logarithm, minimum class prior probability, maximum class prior probability, average class prior probability, class prior probability standard deviation, minimum kurtosis coefficient, maximum kurtosis coefficient, average kurtosis coefficient, kurtosis coefficient standard deviation, minimum devism coefficient, maximum devism coefficient, average devism coefficient, devism coefficient standard deviation, first principal component devism coefficient and first principal component kurtosis coefficient;
wherein the dataset dimension is a ratio of the feature dimension to the number of samples;
the inverse dataset dimension is the inverse of the dataset dimension;
the minimum class prior probability is the minimum value in the numerical value obtained by dividing the number of samples of each class by the total number of the samples;
the maximum class prior probability is the maximum value of the numerical values obtained by dividing the number of samples of each class by the total number of the samples;
the average class prior probability is obtained by dividing the number of samples of each class by the total number of samples;
The class prior probability standard deviation is calculated by a plurality of numerical values obtained by dividing the number of samples of each class by the total number of the samples;
the peak state coefficient is used for measuring the form of the data distribution of the data set relative to the two end parts of the normal distribution, the minimum peak state coefficient is the minimum value of all continuous characteristic peak state coefficients, the maximum peak state coefficient is the maximum value of all continuous characteristic peak state coefficients, the average peak state coefficient is the average value of all continuous characteristic peak state coefficients, and the standard deviation of the peak state coefficient is the standard deviation of all continuous characteristic peak state coefficients;
the bias coefficient is used for measuring the symmetry of the data distribution of the data set about the mean value of the data distribution, the minimum bias coefficient is the minimum value of all the continuous characteristic bias coefficients, the maximum bias coefficient is the maximum value of all the continuous characteristic bias coefficients, the average bias coefficient is the mean value of all the continuous characteristic bias coefficients, and the standard deviation of the bias coefficient is the standard deviation of all the continuous characteristic bias coefficients;
the PCA meta-feature is used for characterizing statistics of principal components in the dataset;
the first principal component bias state coefficient is the bias state coefficient of the largest principal component of the PCA meta-characteristic, and the first principal component kurtosis coefficient is the kurtosis coefficient of the largest principal component of the PCA meta-characteristic.
5. The method of claim 3, wherein the unsupervised cluster-based meta-features include one or more of:
tightness within class;
degree of separation between classes;
davison burg index;
dane index.
6. A method according to claim 3, wherein the preset unsupervised clustering algorithm comprises one or more unsupervised clustering algorithms;
when a plurality of unsupervised clustering algorithms are included, corresponding unsupervised clustering-based meta-features are extracted from the dataset according to each unsupervised clustering algorithm.
7. The method of claim 1, wherein the selecting a set of superparameters from the plurality of sets of superparameters that fit the target dataset based on the prediction results corresponding to each set of superparameters comprises:
determining classification intervals in each prediction result according to the maximum interval criterion;
and selecting a group of super parameters corresponding to the prediction result with the largest classification interval.
8. The method of any of claims 1-7, wherein the method further comprises:
determining a supervised machine learning model trained from the tagged data in the target dataset as a reference supervised machine learning model for the target dataset;
based on the marked data in the target data set, respectively performing cross verification on a semi-supervised machine learning model and a reference supervised machine learning model of the target data set to respectively obtain evaluation values corresponding to the semi-supervised machine learning model and the reference supervised machine learning model;
And determining one of the semi-supervised machine learning model and the reference supervised machine learning model as a final model suitable for the target data set according to the evaluation value.
9. An automatic semi-supervised machine learning apparatus for use in a business scenario for classification based on machine learning techniques, wherein the apparatus comprises:
the acquisition unit is used for acquiring a target data set, part of sample data in the target data set is provided with marks, and the target data set is at least a webpage data set or a medical image data set;
a first determining unit configured to select one experience data set similar to a target data set, and determine a semi-supervised machine learning algorithm used on the experience data set as the semi-supervised machine learning algorithm of the target data set;
if the target data set is a web page data set, the semi-supervised machine learning algorithm used on the experience data set is: an algorithm for performing semi-supervised learning on an empirical data set similar to the web page data set to obtain optimal semi-supervised learning; if the target dataset is a medical image dataset, the semi-supervised machine learning algorithm used on the empirical dataset is: an algorithm for performing semi-supervised learning on an empirical data set similar to the medical image data set to obtain optimal semi-supervised learning;
The selection unit is used for respectively carrying out model training and prediction on the target data set according to the semi-supervised machine learning algorithm and the corresponding multiple groups of super parameters to obtain a model and a prediction result corresponding to each group of super parameters, and selecting a group of super parameters suitable for the target data set from the multiple groups of super parameters according to the prediction result corresponding to each group of super parameters;
a second determining unit, configured to determine a model corresponding to the selected set of hyper-parameters suitable for the target data set as a semi-supervised machine learning model of the target data set, where the semi-supervised machine learning model is at least suitable for processing a web classification traffic scenario or a medical image classification traffic scenario.
10. The apparatus of claim 9, wherein the first determining unit comprises:
an acquisition module for acquiring a plurality of experience data sets;
the extraction module is used for respectively extracting corresponding data set characteristics from the target data set and the plurality of experience data sets;
a determining module for determining an empirical data set from the plurality of empirical data sets that is similar to the target data set based on the dataset characteristics.
11. The apparatus of claim 10, wherein the dataset features include legacy meta-features and non-supervised clustering-based meta-features;
the extraction module comprises:
the first extraction sub-module is used for extracting traditional meta-features from the target data set, extracting corresponding meta-features based on the unsupervised clustering from the target data set according to a preset unsupervised clustering algorithm, and combining the extracted traditional features with the meta-features based on the unsupervised clustering to obtain data set features of the target data set;
and the second extraction sub-module is used for extracting traditional meta-characteristics from each experience data set in the plurality of experience data sets, extracting corresponding meta-characteristics based on the unsupervised clustering from the experience data sets according to the preset unsupervised clustering algorithm, and combining the extracted traditional characteristics and the meta-characteristics based on the unsupervised clustering to obtain the data set characteristics of the experience data set.
12. The apparatus of claim 11, wherein the legacy meta-features comprise any one or more of the following features:
the method comprises the steps of sample number, sample number logarithm, characteristic dimension logarithm, data set dimension logarithm, inverse data set dimension logarithm, minimum class prior probability, maximum class prior probability, average class prior probability, class prior probability standard deviation, minimum kurtosis coefficient, maximum kurtosis coefficient, average kurtosis coefficient, kurtosis coefficient standard deviation, minimum devism coefficient, maximum devism coefficient, average devism coefficient, devism coefficient standard deviation, first principal component devism coefficient and first principal component kurtosis coefficient;
Wherein the dataset dimension is a ratio of the feature dimension to the number of samples;
the inverse dataset dimension is the inverse of the dataset dimension;
the minimum class prior probability is the minimum value in the numerical value obtained by dividing the number of samples of each class by the total number of the samples;
the maximum class prior probability is the maximum value of the numerical values obtained by dividing the number of samples of each class by the total number of the samples;
the average class prior probability is obtained by dividing the number of samples of each class by the total number of samples;
the class prior probability standard deviation is calculated by a plurality of numerical values obtained by dividing the number of samples of each class by the total number of the samples;
the peak state coefficient is used for measuring the form of the data distribution of the data set relative to the two end parts of the normal distribution, the minimum peak state coefficient is the minimum value of all continuous characteristic peak state coefficients, the maximum peak state coefficient is the maximum value of all continuous characteristic peak state coefficients, the average peak state coefficient is the average value of all continuous characteristic peak state coefficients, and the standard deviation of the peak state coefficient is the standard deviation of all continuous characteristic peak state coefficients;
the bias coefficient is used for measuring the symmetry of the data distribution of the data set about the mean value of the data distribution, the minimum bias coefficient is the minimum value of all the continuous characteristic bias coefficients, the maximum bias coefficient is the maximum value of all the continuous characteristic bias coefficients, the average bias coefficient is the mean value of all the continuous characteristic bias coefficients, and the standard deviation of the bias coefficient is the standard deviation of all the continuous characteristic bias coefficients;
The PCA meta-feature is used for characterizing statistics of principal components in the dataset;
the first principal component bias state coefficient is the bias state coefficient of the largest principal component of the PCA meta-characteristic, and the first principal component kurtosis coefficient is the kurtosis coefficient of the largest principal component of the PCA meta-characteristic.
13. The apparatus of claim 11, wherein the unsupervised cluster-based meta-features comprise one or more of:
tightness within class;
degree of separation between classes;
davison burg index;
dane index.
14. The apparatus of claim 11, wherein the preset unsupervised clustering algorithm comprises one or more unsupervised clustering algorithms;
and the extraction module is also used for respectively extracting the corresponding meta-characteristics based on the unsupervised clustering from the data set according to each unsupervised clustering algorithm when the plurality of unsupervised clustering algorithms are included.
15. The apparatus of claim 9, wherein the selection unit comprises:
the determining module is used for determining the classification interval in each prediction result according to the maximum interval criterion;
and the selection module is used for selecting a group of super parameters corresponding to the prediction result with the largest classification interval.
16. The apparatus of any of claims 9-15, wherein the apparatus further comprises:
A third determining unit configured to determine a supervised machine learning model trained with the marked data in the target data set as a reference supervised machine learning model of the target data set;
the verification unit is used for respectively carrying out cross verification on the semi-supervised machine learning model and the reference supervised machine learning model of the target data set based on the marked data in the target data set to respectively obtain evaluation values corresponding to the semi-supervised machine learning model and the reference supervised machine learning model;
and a fourth determining unit configured to determine, according to the evaluation value, one of the semi-supervised machine learning model and the reference supervised machine learning model as a final model suitable for the target data set.
17. A computer readable storage medium, wherein the computer readable storage medium has a computer program stored thereon, wherein the computer program when executed by one or more computing devices implements the method of any of claims 1-8.
18. A system comprising one or more computing devices and one or more storage devices, the one or more storage devices having a computer program recorded thereon, which when executed by the one or more computing devices, causes the one or more computing devices to implement the method of any of claims 1-8.
CN201811341910.0A 2018-11-12 2018-11-12 Method and device for realizing automatic semi-supervised machine learning Active CN111178533B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811341910.0A CN111178533B (en) 2018-11-12 2018-11-12 Method and device for realizing automatic semi-supervised machine learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811341910.0A CN111178533B (en) 2018-11-12 2018-11-12 Method and device for realizing automatic semi-supervised machine learning

Publications (2)

Publication Number Publication Date
CN111178533A CN111178533A (en) 2020-05-19
CN111178533B true CN111178533B (en) 2024-04-16

Family

ID=70655267

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811341910.0A Active CN111178533B (en) 2018-11-12 2018-11-12 Method and device for realizing automatic semi-supervised machine learning

Country Status (1)

Country Link
CN (1) CN111178533B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112804304B (en) * 2020-12-31 2022-04-19 平安科技(深圳)有限公司 Task node distribution method and device based on multi-point output model and related equipment
CN113051452B (en) * 2021-04-12 2022-04-26 清华大学 Operation and maintenance data feature selection method and device
TWI789960B (en) * 2021-10-22 2023-01-11 東吳大學 A three stage recursive method using behavior finance rogo advisor model
CN114462621A (en) * 2022-01-06 2022-05-10 深圳安巽科技有限公司 Machine supervision learning method and device

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2014123341A (en) * 2012-11-26 2014-07-03 Ricoh Co Ltd Information processor, information processing method, program and recording medium
WO2014186488A2 (en) * 2013-05-15 2014-11-20 Microsoft Corporation Tuning hyper-parameters of a computer-executable learning algorithm
WO2014205231A1 (en) * 2013-06-19 2014-12-24 The Regents Of The University Of Michigan Deep learning framework for generic object detection
CN106803099A (en) * 2016-12-29 2017-06-06 东软集团股份有限公司 A kind of data processing method and device
CN106991296A (en) * 2017-04-01 2017-07-28 大连理工大学 Ensemble classifier method based on the greedy feature selecting of randomization
GB201805302D0 (en) * 2018-03-29 2018-05-16 Benevolentai Tech Limited Ensemble Model Creation And Selection
CN108062587A (en) * 2017-12-15 2018-05-22 清华大学 The hyper parameter automatic optimization method and system of a kind of unsupervised machine learning

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2014123341A (en) * 2012-11-26 2014-07-03 Ricoh Co Ltd Information processor, information processing method, program and recording medium
WO2014186488A2 (en) * 2013-05-15 2014-11-20 Microsoft Corporation Tuning hyper-parameters of a computer-executable learning algorithm
WO2014205231A1 (en) * 2013-06-19 2014-12-24 The Regents Of The University Of Michigan Deep learning framework for generic object detection
CN106803099A (en) * 2016-12-29 2017-06-06 东软集团股份有限公司 A kind of data processing method and device
CN106991296A (en) * 2017-04-01 2017-07-28 大连理工大学 Ensemble classifier method based on the greedy feature selecting of randomization
CN108062587A (en) * 2017-12-15 2018-05-22 清华大学 The hyper parameter automatic optimization method and system of a kind of unsupervised machine learning
GB201805302D0 (en) * 2018-03-29 2018-05-16 Benevolentai Tech Limited Ensemble Model Creation And Selection

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
L_p范数约束的多核半监督支持向量机学习方法;胡庆辉;丁立新;何进荣;;软件学报;20131115(11);全文 *
半监督降维方法的实验比较;陈诗国;张道强;;软件学报;20101105(01);全文 *
基于局部敏感Hash的半监督直推SVM增量学习算法;姚明海;林宣民;王宪保;;浙江工业大学学报;20180409(02);全文 *
基于自适应SVM的半监督主动学习视频标注;张建明;孙春梅;闫婷;;计算机工程;20130815(08);全文 *
韩嵩 ; 韩秋弘 ; .半监督学习研究的述评.计算机工程与应用.(06),全文. *

Also Published As

Publication number Publication date
CN111178533A (en) 2020-05-19

Similar Documents

Publication Publication Date Title
CN111178533B (en) Method and device for realizing automatic semi-supervised machine learning
Li et al. Localizing and quantifying damage in social media images
JP6844301B2 (en) Methods and data processors to generate time series data sets for predictive analytics
WO2019196545A1 (en) Data processing method, apparatus and device for insurance fraud identification, and server
CN113610239B (en) Feature processing method and feature processing system for machine learning
Wang et al. Graph convolutional nets for tool presence detection in surgical videos
CN103440512A (en) Identifying method of brain cognitive states based on tensor locality preserving projection
CN111160959B (en) User click conversion prediction method and device
Li et al. Localizing and quantifying infrastructure damage using class activation mapping approaches
Kong et al. Pattern mining saliency
Zhu et al. Age estimation algorithm of facial images based on multi-label sorting
CN113986674A (en) Method and device for detecting abnormity of time sequence data and electronic equipment
Carballal et al. Transfer learning features for predicting aesthetics through a novel hybrid machine learning method
CN113989574B (en) Image interpretation method, image interpretation device, electronic device, and storage medium
Zhao et al. Safe semi-supervised classification algorithm combined with active learning sampling strategy
CN114241411B (en) Counting model processing method and device based on target detection and computer equipment
Suresh et al. A fuzzy based hybrid hierarchical clustering model for twitter sentiment analysis
CN114492657A (en) Plant disease classification method and device, electronic equipment and storage medium
CN113656707A (en) Financing product recommendation method, system, storage medium and equipment
CN105824871A (en) Picture detecting method and equipment
JP5946949B1 (en) DATA ANALYSIS SYSTEM, ITS CONTROL METHOD, PROGRAM, AND RECORDING MEDIUM
CN111091198A (en) Data processing method and device
Burgard et al. Mixed-Integer Linear Optimization for Semi-Supervised Optimal Classification Trees
Xie The analysis on the application of machine learning algorithms in risk rating of P2P online loan platforms
Yan et al. A CNN-based fingerprint image quality assessment method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant