CN111178533A

CN111178533A - Method and device for realizing automatic semi-supervised machine learning

Info

Publication number: CN111178533A
Application number: CN201811341910.0A
Authority: CN
Inventors: 王海; 李宇峰; 涂威威; 魏通
Original assignee: 4Paradigm Beijing Technology Co Ltd
Current assignee: 4Paradigm Beijing Technology Co Ltd
Priority date: 2018-11-12
Filing date: 2018-11-12
Publication date: 2020-05-19
Anticipated expiration: 2038-11-12
Also published as: CN111178533B

Abstract

The invention discloses a method and a device for realizing automatic semi-supervised machine learning, relates to the technical field of machine learning, and mainly aims to solve the problem that the existing semi-supervised machine learning consumes manpower. The main technical scheme of the invention is as follows: acquiring a target data set; selecting an empirical data set similar to the target data set, and determining a semi-supervised machine learning algorithm used on the empirical data set as the semi-supervised machine learning algorithm of the target data set; respectively carrying out model training and prediction on a target data set according to a semi-supervised machine learning algorithm and corresponding groups of hyper-parameters to obtain a model and a prediction result corresponding to each group of hyper-parameters, and selecting one group of hyper-parameters suitable for the target data set from the multiple groups of hyper-parameters according to the prediction result corresponding to each group of hyper-parameters; and determining a set of models corresponding to the hyper-parameters suitable for the target data set as a semi-supervised machine learning model of the target data set. The invention is used for semi-supervised machine learning of target data sets.

Description

Method and device for realizing automatic semi-supervised machine learning

Technical Field

The invention relates to the technical field of machine learning, in particular to a method and a device for realizing automatic semi-supervised machine learning.

Background

With the continuous progress of the technology, the artificial intelligence technology is gradually developed. Among them, machine learning is a necessary product of the development of artificial intelligence research to a certain stage, and aims to improve the performance of the system itself by means of calculation and experience. In a computer system, "experience" is usually in the form of "data" from which a "model" can be generated by a machine learning algorithm, i.e. by providing empirical data to a machine learning algorithm, a model can be generated based on these empirical data, which provides a corresponding judgment, i.e. a prediction, in the face of a new situation. Whether the machine learning model is trained or predicted using the trained machine learning model, the data needs to be converted into machine learning samples including various features.

Currently, there are three types of machine learning, namely supervised machine learning, unsupervised machine learning, and semi-supervised machine learning. The supervised machine learning needs model training by using marked data, and the marked data can be understood as known input and output data. Unsupervised machine learning is the training for finding hidden data relationships among unlabeled sample data for modeling. Semi-supervised machine learning requires model training with the assistance of a large amount of unlabelled data and a small amount of labeled data in sample data. In many cases, the marked data in the sample data is difficult to obtain, so that the quantity of the marked data is far less than that of the unmarked data. For example, in the case of web page classification, the number of web pages actually labeled by the user is very small, so that a large number of unlabeled web pages and a small number of labeled web pages exist in the whole internet. For another example, in the case of medical image classification, a large number of medical images may be collected from a hospital, but marking these images by a physician requires a large amount of manpower, so that a large number of unlabeled images as well as a small number of labeled images are available in the medical image classification task. In this case, based on the characteristics of semi-supervised machine learning, training of models by semi-supervised machine learning for subsequent classification and recognition becomes a main means in the field.

However, for tasks such as web page classification or medical image classification, a large amount of computing resources are often consumed when model training is performed based on semi-supervised machine learning, manual intervention is also required, and automation is difficult to achieve. The process of manual intervention depends heavily on the business experience of the modeler and the degree of well-known algorithm, and the performance of the resulting model is not necessarily superior.

Disclosure of Invention

In view of the above problems, the present invention provides a method and an apparatus for implementing automatic semi-supervised machine learning, and mainly aims to implement the function of automatic semi-supervised machine learning, so as to solve the problem of human consumption in the existing semi-supervised machine learning process.

In order to achieve the purpose, the invention mainly provides the following technical scheme:

in one aspect, the present invention provides a method for implementing automatic semi-supervised machine learning, specifically comprising:

acquiring a target data set, wherein part of sample data in the target data set is provided with a mark;

selecting an empirical data set similar to a target data set, and determining a semi-supervised machine learning algorithm used on the empirical data set as the semi-supervised machine learning algorithm of the target data set;

respectively carrying out model training and prediction on the target data set according to the semi-supervised machine learning algorithm and the corresponding multiple groups of hyper-parameters to obtain a model and a prediction result corresponding to each group of hyper-parameters, and selecting one group of hyper-parameters suitable for the target data set from the multiple groups of hyper-parameters according to the prediction result corresponding to each group of hyper-parameters;

determining a model corresponding to the selected set of hyper-parameters that fits the target dataset as a semi-supervised machine learning model of the target dataset.

Optionally, the selecting an empirical data set similar to the target data set includes:

acquiring a plurality of experience data sets;

extracting corresponding data set characteristics from the target data set and the plurality of experience data sets respectively;

determining one empirical data set from the plurality of empirical data sets that is similar to the target data set based on the data set characteristics.

Optionally, the data set features include traditional meta-features and meta-features based on unsupervised clustering;

the extracting corresponding data set features from the target data set and the plurality of empirical data sets, respectively, comprises:

extracting traditional meta-features from the target data set, extracting corresponding meta-features based on unsupervised clustering from the target data set according to a preset unsupervised clustering algorithm, and combining the extracted traditional features and the meta-features based on unsupervised clustering to obtain data set features of the target data set;

for each experience data set in the plurality of experience data sets, extracting traditional meta-features from the experience data set, extracting corresponding meta-features based on unsupervised clustering from the experience data set according to the preset unsupervised clustering algorithm, and combining the extracted traditional features and the meta-features based on unsupervised clustering to obtain the data set features of the experience data set.

Optionally, the legacy meta-feature comprises any one or more of the following features:

sample number, sample number logarithm, feature dimension logarithm, dataset dimension logarithm, inverse dataset dimension logarithm, minimum class prior probability, maximum class prior probability, average class prior probability, class prior probability standard deviation, minimum kurtosis coefficient, maximum kurtosis coefficient, average kurtosis coefficient, kurtosis coefficient standard deviation, minimum skewness coefficient, maximum skewness coefficient, average skewness coefficient, skewness coefficient standard deviation, PCA 95% principal component, first principal component skewness coefficient, and first principal component kurtosis coefficient;

wherein the dataset dimension is a ratio of the feature dimension to the number of samples;

the inverse dataset dimension is the inverse of the dataset dimension;

the minimum class prior probability is the minimum value of numerical values obtained by dividing the number of each class sample by the total number of the samples;

the maximum class prior probability is the maximum value of numerical values obtained by dividing the number of each class sample by the total number of the samples;

the average class prior probability is an average value in numerical values obtained by dividing the number of each class sample by the total number of the samples;

the class prior probability standard deviation is a standard deviation calculated by a plurality of numerical values obtained by dividing the number of each class sample by the total number of the samples;

the kurtosis coefficient is used for measuring the form of the data distribution of the data set relative to the two end parts of normal distribution, the minimum kurtosis coefficient is the minimum value of all continuous characteristic kurtosis coefficients, the maximum kurtosis coefficient is the maximum value of all continuous characteristic kurtosis coefficients, the average kurtosis coefficient is the average value of all continuous characteristic kurtosis coefficients, and the standard deviation of the kurtosis coefficient is the standard deviation of all continuous characteristic kurtosis coefficients;

the skewing coefficient is used for measuring the symmetry of the data distribution of the data set about the mean value of the data distribution, the minimum skewing coefficient is the minimum value of all continuous characteristic skewing coefficients, the maximum skewing coefficient is the maximum value of all continuous characteristic skewing coefficients, the average skewing coefficient is the average value of all continuous characteristic skewing coefficients, and the standard deviation of the skewing coefficient is the standard deviation of all continuous characteristic skewing coefficients;

the PCA element characteristics are used for representing the statistics of main components in a data set, the PCA 95% of main components are obtained by performing main component analysis on a sample and then reserving d 'main components according to the variance from large to small, and the d is a characteristic dimension and reserves 95% of the variance of d'/d in the original data;

the first principal component skewness coefficient is a skewness coefficient of the largest principal component of the PCA element characteristics, and the first principal component kurtosis coefficient is a kurtosis coefficient of the largest principal component of the PCA element characteristics.

Optionally, the meta-features based on unsupervised clustering include one or more of the following:

intra-class compactness;

degree of inter-class separation;

davison baud index;

dunne index.

Optionally, the preset unsupervised clustering algorithm includes one or more unsupervised clustering algorithms;

when various unsupervised clustering algorithms are included, corresponding meta-features based on unsupervised clustering are respectively extracted from the data set according to each unsupervised clustering algorithm.

Optionally, the selecting, according to the prediction result corresponding to each group of hyper-parameters, a group of hyper-parameters suitable for the target data set from the multiple groups of hyper-parameters includes:

determining classification intervals in each prediction result according to a maximum interval criterion;

and selecting a group of hyper-parameters corresponding to the prediction result with the largest classification interval.

Optionally, the method further includes:

determining a supervised machine learning model trained on labeled data in the target dataset as a baseline supervised machine learning model for the target dataset;

respectively performing cross validation on a semi-supervised machine learning model and a reference supervised machine learning model of the target data set based on marked data in the target data set to respectively obtain evaluation values corresponding to the semi-supervised machine learning model and the reference supervised machine learning model;

and determining one of the semi-supervised machine learning model and the benchmark supervised machine learning model as a final model suitable for the target data set according to the evaluation value.

On the other hand, the invention provides a device for realizing automatic semi-supervised machine learning, which specifically comprises:

the device comprises an acquisition unit, a storage unit and a processing unit, wherein the acquisition unit is used for acquiring a target data set, and part of sample data in the target data set is provided with marks;

a first determining unit, configured to select an empirical data set similar to a target data set, and determine a semi-supervised machine learning algorithm used on the empirical data set as a semi-supervised machine learning algorithm of the target data set;

the selection unit is used for respectively carrying out model training and prediction on the target data set according to the semi-supervised machine learning algorithm and the corresponding groups of hyper-parameters to obtain a model and a prediction result corresponding to each group of hyper-parameters, and selecting a group of hyper-parameters suitable for the target data set from the groups of hyper-parameters according to the prediction result corresponding to each group of hyper-parameters;

a second determination unit for determining a model corresponding to the selected set of hyper-parameters suitable for the target dataset as a semi-supervised machine learning model of the target dataset.

Optionally, the first determining unit includes:

the acquisition module is used for acquiring a plurality of experience data sets;

an extraction module, configured to extract corresponding data set features from the target data set and the plurality of empirical data sets, respectively;

a determination module for determining from the plurality of empirical data sets a similar empirical data set to the target data set based on the data set characteristics.

the extraction module comprises:

the first extraction sub-module is used for extracting traditional meta-features from the target data set, extracting corresponding meta-features based on unsupervised clustering from the target data set according to a preset unsupervised clustering algorithm, and combining the extracted traditional features and the meta-features based on unsupervised clustering to obtain data set features of the target data set;

and the second extraction sub-module is used for extracting traditional meta-features from each experience data set in the plurality of experience data sets, extracting corresponding meta-features based on unsupervised clustering from the experience data sets according to the preset unsupervised clustering algorithm, and combining the extracted traditional features and the meta-features based on unsupervised clustering to obtain the data set features of the experience data sets.

the inverse dataset dimension is the inverse of the dataset dimension;

intra-class compactness;

degree of inter-class separation;

davison baud index;

dunne index.

the extraction module is further used for extracting corresponding meta-features based on unsupervised clustering from the data set according to each unsupervised clustering algorithm when various unsupervised clustering algorithms are included.

Optionally, the selecting unit includes:

the determining module is used for determining the classification interval in each prediction result according to the maximum interval criterion;

and the selection module is used for selecting a group of hyper-parameters corresponding to the prediction result with the maximum classification interval.

Optionally, the apparatus further comprises:

a third determining unit, configured to determine a supervised machine learning model trained on labeled data in the target data set as a reference supervised machine learning model of the target data set;

the verification unit is used for respectively performing cross verification on the semi-supervised machine learning model and the reference supervised machine learning model of the target data set based on the marked data in the target data set to respectively obtain evaluation values corresponding to the semi-supervised machine learning model and the reference supervised machine learning model;

a fourth determination unit configured to determine one of the semi-supervised machine learning model and the reference supervised machine learning model as a final model suitable for the target data set, according to the evaluation value.

In another aspect, the present invention provides a computer-readable storage medium, wherein the computer-readable storage medium has a computer program stored thereon, and wherein the computer program, when executed by one or more computing devices, implements the method for implementing automatic semi-supervised machine learning according to the first aspect.

In another aspect, the present invention provides a system comprising one or more computing devices and one or more storage devices having a computer program recorded thereon, which when executed by the one or more computing devices, causes the one or more computing devices to implement the method of implementing automatic semi-supervised machine learning as described in the first aspect above.

By means of the technical scheme, the method and the device for realizing automatic semi-supervised machine learning can obtain a target data set, then select an empirical data set similar to the target data set, determine a semi-supervised machine learning algorithm used on the empirical data set as a semi-supervised machine learning algorithm of the target data set, respectively perform model training and prediction on the target data set according to the semi-supervised machine learning algorithm and corresponding multiple groups of hyper-parameters to obtain a model and a prediction result corresponding to each group of hyper-parameters, select a group of hyper-parameters suitable for the target data set from the multiple groups of hyper-parameters according to the prediction result corresponding to each group of hyper-parameters, and finally determine a model corresponding to the selected group of hyper-parameters suitable for the target data set as a semi-supervised machine learning model of the target data set, thereby realizing the function of automatic semi-supervised machine learning. Compared with the existing problem that manual intervention is needed in the process of semi-supervised machine learning, the semi-supervised machine learning method and the semi-supervised machine learning system can determine the needed semi-supervised machine learning algorithm according to the empirical data set similar to the target data set, determine a group of hyper-parameters suitable for the target data set from a plurality of groups of hyper-parameters corresponding to the semi-supervised learning algorithm, and determine the semi-supervised machine learning model of the target data set according to the model corresponding to the group of hyper-parameters, so that the function of automatically performing semi-supervised learning is realized, and the problem of labor consumption caused by manual intervention in the process of modeling and model selection is avoided.

The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

fig. 1 shows a flowchart of a method for implementing automatic semi-supervised machine learning according to an embodiment of the present invention;

FIG. 2 is a flow chart illustrating another method for implementing automatic semi-supervised machine learning according to an embodiment of the present invention;

FIG. 3 is a diagram illustrating an example of a process for determining an optimal hyperparameter based on a maximum separation criterion according to an embodiment of the present invention;

FIG. 4 is a block diagram illustrating an implementation of an automatic semi-supervised machine learning apparatus according to an embodiment of the present invention;

fig. 5 shows a block diagram of another apparatus for implementing automatic semi-supervised machine learning according to an embodiment of the present invention.

Detailed Description

With the advent of mass data, artificial intelligence techniques have been rapidly developed, and in order to extract values from mass data, it is required that relevant persons not only be proficient in artificial intelligence techniques (particularly machine learning techniques), but also be very familiar with specific scenarios (e.g., image processing, voice processing, automatic control, financial services, internet advertising, etc.) in which machine learning techniques are applied. For example, if the relevant personnel have insufficient knowledge of the business or experience of modeling is insufficient, poor modeling effect is easily caused. The phenomenon can be relieved from two aspects at present, firstly, the threshold of machine learning is reduced, and the machine learning algorithm is easy to use; and secondly, the model precision is improved, so that the algorithm has high universality and can generate better results. It will be appreciated that these two aspects are not opposed, as the enhancement of the effect of the algorithm in the second aspect may assist the first point. Furthermore, when it is desired to perform corresponding target prediction by using a neural network model, the relevant person not only needs to be familiar with various complex technical details about the neural network, but also needs to understand business logic behind data related to the predicted target, for example, if the machine learning model is used to identify a criminal suspect, the relevant person must also understand which characteristics may be possessed by the criminal suspect; if a machine learning model is used to distinguish fraudulent transactions in the financial industry, the related personnel must also know the transaction habits in the financial industry and a series of corresponding expert rules. All the above bring great difficulty to the application prospect of the machine learning technology.

Therefore, the technical means to solve the above problems are desired by the technicians, which effectively improves the effect of the neural network model and reduces the threshold of model training and application. In this process, many technical problems are involved, for example, to obtain a practical and effective model, not only the non-ideal of the training data itself (for example, insufficient training data, missing training data, sparse training data, distribution difference between training data and prediction data, etc.) but also the problem of computational efficiency of mass data needs to be solved. That is, it is not possible in reality to perform the machine learning process with a perfect training data set, relying on an infinitely complex ideal model. As a data processing system or method for prediction purposes, any scheme for training a model or a scheme for prediction using a model must be subject to objectively existing data limitations and computational resource limitations, and the above technical problems are solved by using a specific data processing mechanism in a computer. These data processing mechanisms rely on the processing power, processing mode and processing data of the computer, and are not purely mathematical or statistical calculations.

Exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the invention are shown in the drawings, it should be understood that the invention can be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.

The embodiment of the invention provides a method for realizing automatic semi-supervised machine learning, which can realize automatic semi-supervised machine learning on a target data set without manual intervention, thereby solving the problem of manpower consumption caused by the existing manually-intervened semi-supervised machine learning process and the problem of serious dependence on business experience and professional degree of a modeler. The method comprises the following specific steps as shown in figure 1:

101. a target data set is acquired.

Wherein part of the sample data in the target data set has a marker. In the embodiment of the present invention, the target data set may be any data set, such as a web page data set or a medical image data set. In the embodiment of the present invention, the marked data may be understood as data of a known classification result, and the unmarked data may be understood as data of an unknown classification result.

In the embodiment of the present invention, the process of acquiring the target data set may be performed according to any existing acquisition manner, for example, an interface dedicated to input of the target data set may be provided, and the target data set may be acquired through the interface.

102. Selecting an empirical data set similar to a target data set, and determining a semi-supervised machine learning algorithm used on the empirical data set as the semi-supervised machine learning algorithm of the target data set.

Here, the empirical data set refers to a data set on which semi-supervised learning has been performed and on which a semi-supervised algorithm is known to perform well.

In practical applications, data distributions of different data sets are different, and there are many data sets with similar data distributions based on the data distribution. In this embodiment, based on the principle that similar data sets have the same preference for learning algorithms, semi-supervised algorithms used on empirical data sets similar to the target data sets are selected for the target data sets.

Therefore, based on the similarity of the data sets and the characteristics of algorithm selection, in the process of performing semi-supervised machine learning according to the embodiment of the present invention, after the target data set is obtained in the foregoing step 101, in this step, an empirical data set close to the target data set may be determined through the target data set. In embodiments of the present invention, there are multiple sets of empirical data, and algorithms for optimal semi-supervised learning for each set of empirical data are known. And then initializing the semi-supervised machine learning algorithm corresponding to the empirical data set most similar to the target data set as the semi-supervised machine learning algorithm of the target data set.

103. And respectively carrying out model training and prediction on the target data set according to the semi-supervised machine learning algorithm and the corresponding multiple groups of hyper-parameters to obtain a model and a prediction result corresponding to each group of hyper-parameters, and selecting one group of hyper-parameters suitable for the target data set from the multiple groups of hyper-parameters according to the prediction result corresponding to each group of hyper-parameters.

After the semi-supervised machine learning algorithm is selected, because each semi-supervised machine learning algorithm has multiple corresponding sets of hyper-parameters, but the models corresponding to each set of hyper-parameters are not all suitable for the target data set, in this step, the set of hyper-parameters suitable for the target data set needs to be selected from the multiple sets of hyper-parameters. In the process of selecting a group of hyper-parameters suitable for the target data set, the training and prediction of the model can be respectively carried out on the target data set according to the determined semi-supervised learning algorithm and the multiple groups of hyper-parameters, and prediction result data corresponding to each group of hyper-parameters is obtained. For example, the semi-supervised algorithm is a, the sets of hyper-parameters are B1, B2, … …, Bn, then training and prediction are performed on the target dataset according to a + B1, respectively, training and prediction are performed on the target dataset according to a + B2, … …, and training and prediction are performed on the target dataset according to a + Bn. Since there are a plurality of sets of superparameters, in the present embodiment, the superparameters of the set that best meets the target data set are taken as the set of superparameters corresponding to the target data set from the plurality of sets of superparameters according to the prediction result.

104. Determining a model corresponding to the selected set of hyper-parameters that fits the target dataset as a semi-supervised machine learning model of the target dataset.

After determining a set of hyper-parameters suitable for the target data set based on step 103, since the semi-supervised machine learning model corresponding to each set of hyper-parameters is also trained in step 103, when the set of hyper-parameters most suitable for the target data set is determined, it is actually determined that the model corresponding to the set of hyper-parameters is the model suitable for the target data set. Therefore, according to the method described in this step, the model corresponding to the set of hyper-parameters obtained in step 103 can be determined as the semi-supervised machine learning model corresponding to the target data set.

In an embodiment of the present invention, the target data may be other types of image data, voice data, data for describing an engineering control object, data for describing a user (or behavior thereof), data for describing an object and/or an event in various fields of administration, business, medical, supervision, finance, and the like, in addition to web page data or medical image data.

Further, as a further refinement and extension of the foregoing embodiment, in an embodiment of the present invention, another method for implementing automatic semi-supervised machine learning is further provided, specifically, as shown in fig. 2, where the steps include:

201. a target data set is acquired.

The target data set may be a data set of web page data or a data set of medical image data, which is not limited herein and may be determined according to actual conditions. Wherein, part of sample data in the target data set has marks, and the rest is unmarked data. In addition, in the embodiment of the present invention, the manner of acquiring the target data set, the marked data, the unmarked data, and the target data set is consistent with the description in step 101 in the foregoing embodiment, and is not repeated herein.

202. Selecting an empirical data set similar to a target data set, and determining a semi-supervised machine learning algorithm used on the empirical data set as the semi-supervised machine learning algorithm of the target data set.

In this step, when selecting an empirical data set similar to the target data set, the following steps may be specifically performed: first, a plurality of empirical data sets are acquired. Then, corresponding data set features are extracted from the target data set and the plurality of empirical data sets, respectively. The data set features are used for characterizing data distribution, data set structure and the like in the data set. Finally, one empirical data set similar to the target data set is determined from the plurality of empirical data sets based on the data set characteristics. Therefore, the experience data set similar to the target data set can be determined from the plurality of experience data sets according to the characteristics of the data sets, the accuracy of the selected experience data set is ensured, a foundation is laid for the selection of a subsequent semi-supervised machine learning algorithm, and the accuracy of the integral automatic semi-supervised machine learning is ensured.

Further, the dataset features may include traditional meta-features as well as unsupervised clustering based meta-features. Here, a specific way of extracting the corresponding data set features from the target data set and the plurality of empirical data sets may include: extracting traditional meta-features from the target data set, extracting corresponding meta-features based on unsupervised clustering from the target data set according to a preset unsupervised clustering algorithm, and combining the extracted traditional meta-features and the meta-features based on unsupervised clustering to obtain the data set features of the target data set. For each experience data set in the plurality of experience data sets, extracting traditional meta-features from the experience data set, extracting corresponding meta-features based on unsupervised clustering from the experience data set according to the preset unsupervised clustering algorithm, and combining the extracted traditional meta-features and the meta-features based on unsupervised clustering to obtain the data set features of the experience data set. Therefore, the data set characteristics are determined by combining the element characteristics extracted by the unsupervised clustering algorithm and based on unsupervised clustering with the traditional element characteristics, the accuracy of the data set characteristics of the target data set and the empirical data set can be ensured, and then the accuracy of determining the empirical data set similar to the target data set is guaranteed.

It should be noted that the preset unsupervised clustering algorithm selected in the embodiment of the present invention may be one or more unsupervised clustering algorithms, for example, unsupervised clustering algorithms such as spectral clustering, k-means clustering, hierarchical clustering, and the like. And when multiple unsupervised clustering algorithms are selected, in the step, corresponding meta-features based on unsupervised clustering can be respectively extracted from the same data set according to each unsupervised clustering algorithm, so that the meta-features based on unsupervised clustering of the same data set can be extracted based on the multiple unsupervised clustering algorithms, the description of the distribution of the data set by the meta-features based on unsupervised clustering can be more accurate, and the accuracy of the determined empirical data set similar to the target data set can be further improved.

Furthermore, in the embodiment of the present invention, the legacy meta-feature may include any one or more of the following: sample number, sample number logarithm, feature dimension logarithm, dataset dimension logarithm, inverse dataset dimension logarithm, minimum class prior probability, maximum class prior probability, average class prior probability, class prior probability standard deviation, minimum kurtosis coefficient, maximum kurtosis coefficient, average kurtosis coefficient, kurtosis coefficient standard deviation, minimum skewness coefficient, maximum skewness coefficient, average skewness coefficient, skewness coefficient standard deviation, PCA 95% principal component, first principal component skewness coefficient, and first principal component kurtosis coefficient.

Wherein, in the above conventional meta-feature, the dataset dimension is a ratio of the feature dimension to the number of samples; the inverse dataset dimension is the inverse of the dataset dimension; the minimum class prior probability is the minimum value of numerical values obtained by dividing the number of each class sample by the total number of the samples; the maximum class prior probability is the maximum value of numerical values obtained by dividing the number of samples of each class by the total number of the samples; the average class prior probability is the average value of the numerical values obtained by dividing the number of each class sample by the total number of the samples; the class prior probability standard deviation is a standard deviation calculated by a plurality of numerical values obtained by dividing the number of samples of each class by the total number of the samples.

in addition, in the embodiment of the present invention, the kurtosis coefficient is used to measure the shape of the data distribution of the data set relative to the two end portions of the normal distribution, wherein the kurtosis coefficient β may be defined as:

where μ X is the mean of the continuous variable and σ X is the standard deviation of the continuous variable.

The minimum kurtosis coefficient is the minimum value of all continuous characteristic kurtosis coefficients; the maximum kurtosis coefficient is the maximum value of all continuous characteristic kurtosis coefficients; the average kurtosis coefficient is the average value of all continuous characteristic kurtosis coefficients; the standard deviation of the kurtosis coefficient is the standard deviation of all continuous characteristic kurtosis coefficients.

Skewing coefficients are used to measure the symmetry of the data distribution of a data set with respect to its mean. Wherein the skewness factor γ can be defined as:

The minimum skewness coefficient is the minimum value of all continuous characteristic skewness coefficients; the maximum skewness coefficient is the maximum value of all continuous characteristic skewness coefficients; the average skewness coefficient is the average value of all continuous characteristic skewness coefficients; the standard deviation of the skewness coefficient is the standard deviation of all continuous characteristic skewness coefficients.

In addition, the PCA meta-feature can be understood as a statistic for characterizing the principal components in the data set, and the PCA 95% principal component in the embodiment of the present invention is that after the principal component analysis is performed on the sample, the d ' principal components are preserved from large to small according to the variance, and the d is a feature dimension, wherein the d ' principal component retains 95% of the variance of d '/d in the original data;

Still further, in an embodiment of the present invention, the meta-feature based on unsupervised clustering may include one or more of the following: intra-class compactness, inter-class separation, davison burgunds index, and dunne index.

The Intra-class compactness (Intra-class compactness for short) is obtained by calculating the average distance from each point in each class to the clustering center of the point, and averaging the average distances of the clustering centers of all the classes, and specifically comprises the following steps:

in the above-mentioned formula,

calculated is the closeness of the ith class, where x_iIs a sample belonging to the ith class, w_iIs the cluster center of the ith class,

is obtained by calculating the average of the distances from the cluster center for all samples in the ith class.

Is to calculate all classes

Then, an average is taken, where k represents the number of classifications of the data in the data set.

Lower means closer together like distance.

The Inter-class separation (Inter-cluster separation, abbreviated as Inter-class separation) is obtained by calculating the average distance between every two cluster centers, and specifically comprises the following steps:

in the above formula, w represents the cluster center in the data set, w_i-w_jRepresents the distance between the cluster centers of the ith and jth classes,

represents the average of the distances between all cluster centers in the data set, wherein,

higher means that the inter-class clustering distance is farther.

The Davies-Bouldin Index (translation, Davies-Bouldin Index, DBI for short) is obtained by calculating the sum of the mean values of the distances in any two classes, dividing the sum by the distance between the centers of the two clusters, and taking the maximum value of the distances, and specifically comprises the following steps:

wherein, in the above formula, k represents the number of classifications of the data in the data set;

is the average distance of the data within the class to the cluster center, e.g.

Representing the degree of dispersion of each data in the ith class; w is a_i-w_jThe distance between the ith class and the jth class, i.e. the distance between the centers of two clusters. Smaller DB values indicate that the clustering results are close to the inside of the same cluster, and different clusters are far apart.

The dunne Index (DVI) is obtained by dividing the shortest distance between the classes of any two cluster data by the largest distance in any cluster, and specifically includes:

in the formula, the molecular part represents the minimum value in any inter-class distance, namely the shortest distance between clusters; the denominator part represents the maximum value of the distance in the class, namely the maximum distance in any class, and according to the formula, the larger the DVI value is, the closer the internal parts of the same class in the clustering result are, and the more distant the different classes are. I.e., the larger the inter-class distance, the smaller the intra-class distance.

203. And respectively carrying out model training and prediction on the target data set according to the semi-supervised machine learning algorithm and the corresponding multiple groups of hyper-parameters to obtain a model and a prediction result corresponding to each group of hyper-parameters, and selecting one group of hyper-parameters suitable for the target data set from the multiple groups of hyper-parameters according to the prediction result corresponding to each group of hyper-parameters.

After the semi-supervised machine learning algorithms corresponding to the target data set are determined in the foregoing step 202, a plurality of sets of hyper-parameters exist based on each semi-supervised machine learning algorithm, and not all hyper-parameters are suitable for the target data set, which may possibly result in accuracy of a subsequently obtained semi-supervised machine learning model if an incorrect set of hyper-parameters is selected. Therefore, in this step, a set of hyper-parameters suitable for the target data set needs to be determined from the plurality of sets of hyper-parameters.

In this step, when it selects a set of hyper-parameters suitable for the target data set according to the prediction result, it may be: and determining the classification interval in each prediction result according to the maximum interval criterion, and then selecting a group of hyper-parameters corresponding to the prediction result with the maximum classification interval. The maximum interval quasi-side (LM, i.e. maximum interval criterion) used in the embodiment of the present invention is used to predict the result of each data in the data set through different models, and it is considered that the larger the classification interval is, the better the performance of the corresponding model is. According to the method for determining the hyper-parameters suitable for the target data set based on the maximum interval criterion, the complex processes of verification set division and model evaluation are omitted, and the hyper-parameter selection efficiency is improved. For example, as shown in fig. 3, the method in this step may include that after the two sets of hyper-parameters are respectively subjected to model training, prediction results can be obtained, and it can be seen from this figure that the positive-negative class interval of the model prediction result obtained by the hyper-parameter 1 is larger than the positive-negative class interval of the model prediction result obtained by the hyper-parameter c, so when the preset algorithm includes the two sets of hyper-parameters 1 and c, it may be determined that the hyper-parameter 1 has a better prediction effect, and then it may be determined that the hyper-parameter 1 is the set of hyper-parameters suitable for the data set in this example.

204. Determining a model corresponding to the selected set of hyper-parameters that fits the target dataset as a semi-supervised machine learning model of the target dataset.

According to the method of step 203, in the process of determining a set of hyper-parameters suitable for the target data set, since the corresponding model is trained, after the set of hyper-parameters suitable for the target data set is determined in step 203, the model corresponding to the set of hyper-parameters is actually the most suitable semi-supervised machine learning model for the target data set, and herein, the model corresponding to the set of hyper-parameters determined in step 203 may be determined as the semi-supervised machine learning model for the target data set according to the method of this step.

205. Determining a supervised machine learning model trained on labeled data in the target dataset as a baseline supervised machine learning model for the target dataset.

A common problem in semi-supervised learning is performance degradation, i.e. the model prediction performance obtained after using unlabeled data is rather worse than that obtained with only labeled data. Therefore, in the embodiment of the present invention, since the target data set includes labeled data, although such data is less, in order to further ensure the accuracy of the machine learning model corresponding to the target data set, in the embodiment of the present invention, the labeled data set in the target data set may be used to train a corresponding reference supervised machine learning model according to the method described in this step. Here, the training process may be based on the prior art, and is not described in detail here.

206. And respectively performing cross validation on the semi-supervised machine learning model and the reference supervised machine learning model of the target data set based on the marked data in the target data set to respectively obtain evaluation values corresponding to the semi-supervised machine learning model and the reference supervised machine learning model.

After the reference supervised machine learning model is trained in step 205, in order to determine which of the semi-supervised machine learning model and the reference supervised machine learning model obtained in step 204 and step 205 respectively is more suitable for the target data set, the method described in this step needs to perform cross validation on the semi-supervised machine learning model and the reference supervised machine learning model of the target data set according to the marked data in the target data set, and obtain the evaluation values corresponding to these two models respectively.

Specifically, in the embodiment of the present invention, the two models may be verified by a K-fold cross-validation method, and corresponding evaluation values are obtained respectively. For example, assuming that there are 100 pieces of marked data, when K is 2, the marked data may be divided into A, B two groups of 50 pieces of data each. On the one hand, the benchmark supervised machine learning model was trained on the a group data, and verified with the B group data to obtain the evaluation value X1. On the other hand, after the label of the B group is removed, the selected semi-supervised learning algorithm and a group of super parameters are used for training the model on the labeled A group data and the unlabelled B group data, the prediction result on the unlabelled B group data can be obtained after the model training is finished, and then the prediction result is compared with the real label of the B group data to obtain the evaluation value Y1. Then, after interchanging A, B two sets of data, the above process was repeated and evaluated values X2 and Y2 were obtained, respectively. Wherein the sum of X1 and X2 is the evaluation value of the reference supervised machine learning model, and the sum of Y1 and Y2 is the evaluation value of the semi-supervised machine learning model.

207. And determining one of the semi-supervised machine learning model and the benchmark supervised machine learning model as a final model suitable for the target data set according to the evaluation value.

After obtaining the evaluation value of the reference supervised machine learning model and the evaluation value of the semi-supervised machine learning model based on the foregoing step 206, respectively, comparison may be performed according to the magnitude of the evaluation values of the two, so as to determine the larger model as the final model suitable for the target data set.

For example, according to the example described in the foregoing step 206, when the sum of Y1 and Y2 is greater than the sum of X1 and X2, it may be determined that the prediction accuracy of the semi-supervised machine learning model for the result is better than the quasi-go rate of the benchmark supervised machine learning model, and then it is determined to select the semi-supervised machine learning model as the final model; otherwise, determining the reference supervised machine learning model as the final model.

Thus, by determining the supervised machine learning model trained by the labeled data in the target data set as the reference supervised machine learning model of the target data set, performing cross validation on the semi-supervised machine learning model and the reference supervised machine learning model of the target data set based on the labeled data in the target data set, respectively, obtaining evaluation values corresponding to the semi-supervised machine learning model and the reference supervised machine learning model, respectively, and determining one of the semi-supervised machine learning model and the reference supervised machine learning model as the final model suitable for the target data set according to the evaluation values, it is possible to ensure that when the accuracy of the determined semi-supervised machine learning model is low, it is possible to select the effect of the reference supervised machine learning model as the final model, and thus it is possible to ensure the actual accuracy of the machine learning effect, the problem of performance degradation which may occur in the semi-supervised machine learning process is solved.

Further, as an implementation of the method for implementing automatic semi-supervised machine learning, an embodiment of the present invention provides an apparatus for implementing automatic semi-supervised machine learning, which is mainly used for implementing an automatic semi-supervised machine learning function on a target data set, and solving a problem of human consumption in a semi-supervised machine learning process due to manual intervention. For convenience of reading, details in the foregoing method embodiments are not described in detail again in this apparatus embodiment, but it should be clear that the apparatus in this embodiment can correspondingly implement all the contents in the foregoing method embodiments. As shown in fig. 4, the apparatus specifically includes:

an obtaining unit 31, configured to obtain a target data set, where a part of sample data in the target data set has a flag;

a first determining unit 32, configured to select an empirical data set similar to the target data set, and determine a semi-supervised machine learning algorithm used on the empirical data set as the semi-supervised machine learning algorithm of the target data set acquired by the acquiring unit 31;

the selecting unit 33 may be configured to perform model training and prediction on the target data set according to the semi-supervised machine learning algorithm determined by the first determining unit 32 and the corresponding multiple sets of hyper-parameters, respectively, to obtain a model and a prediction result corresponding to each set of hyper-parameters, and select a set of hyper-parameters suitable for the target data set from the multiple sets of hyper-parameters according to the prediction result corresponding to each set of hyper-parameters;

a second determining unit 34 for determining a model corresponding to the set of hyper-parameters selected by the selecting unit 33 to fit the target data set as a semi-supervised machine learning model of the target data set.

Further, as shown in fig. 5, the first determining unit 32 includes:

an obtaining module 321, which may be configured to obtain a plurality of experience data sets;

an extracting module 322, configured to extract corresponding data set features from the target data set and the plurality of empirical data sets obtained by the obtaining module 321, respectively;

a determining module 323, configured to determine an empirical data set similar to the target data set from the plurality of empirical data sets according to the data set characteristics extracted by the extracting module 322.

Further, as shown in fig. 5, the data set features include traditional meta-features and meta-features based on unsupervised clustering;

the extraction module 322 includes:

the first extracting sub-module 3221 may be configured to extract traditional meta-features from the target data set, extract corresponding meta-features based on unsupervised clustering from the target data set according to a preset unsupervised clustering algorithm, and combine the extracted traditional features and the meta-features based on unsupervised clustering to obtain data set features of the target data set;

the second extracting sub-module 3222 may be configured to, for each empirical data set of the plurality of empirical data sets, extract traditional meta-features from the empirical data set, extract corresponding meta-features based on unsupervised clustering from the empirical data set according to the preset unsupervised clustering algorithm, and combine the extracted traditional features and the meta-features based on unsupervised clustering to obtain data set features of the empirical data set.

Further, as shown in fig. 5, the conventional meta-feature includes any one or more of the following features:

the inverse dataset dimension is the inverse of the dataset dimension;

Further, as shown in fig. 5, the meta-feature based on unsupervised clustering includes one or more of the following:

intra-class compactness;

degree of inter-class separation;

davison baud index;

dunne index.

Further, as shown in fig. 5, the preset unsupervised clustering algorithm includes one or more unsupervised clustering algorithms;

the extracting module 322 may be further configured to, when multiple unsupervised clustering algorithms are included, respectively extract corresponding meta-features based on unsupervised clustering from the data set according to each unsupervised clustering algorithm.

Further, as shown in fig. 5, the selecting unit 33 includes:

the determining module 331 may be configured to determine a classification interval in each prediction result according to a maximum interval criterion;

the selecting module 332 may be configured to select a set of hyper-parameters corresponding to the prediction result with the largest classification interval from the plurality of classification intervals determined by the determining module 331.

Further, as shown in fig. 5, the apparatus further includes:

a third determining unit 35, operable to determine a supervised machine learning model trained on labeled data in the target data set as a reference supervised machine learning model of the target data set;

a verification unit 36, configured to perform cross-validation on the semi-supervised machine learning model of the target data set determined by the second determination unit 34 and the reference supervised machine learning model determined by the third determination unit 35 based on the marked data in the target data set, so as to obtain evaluation values corresponding to the semi-supervised machine learning model and the reference supervised machine learning model, respectively;

a fourth determining unit 37, configured to determine, according to the evaluation value obtained by the verifying unit 36, one of the semi-supervised machine learning model and the reference supervised machine learning model as a final model suitable for the target data set.

Further, the embodiment of the present invention also provides a computer-readable storage medium, where the computer-readable storage medium stores a computer program thereon, where the computer program is executed by one or more computing devices to implement the above method for implementing automatic semi-supervised machine learning.

In addition, the embodiment of the present invention also provides a system including one or more computing devices and one or more storage devices, on which a computer program is recorded, where the computer program, when executed by the one or more computing devices, causes the one or more computing devices to implement the above-mentioned method for implementing automatic semi-supervised machine learning.

In summary, the method and apparatus for implementing automatic semi-supervised machine learning provided in the embodiments of the present invention can obtain a target data set, then select an empirical data set similar to the target data set, determine a semi-supervised machine learning algorithm used in the empirical data set as the semi-supervised machine learning algorithm of the target data set, then perform model training and prediction on the target data set according to the semi-supervised machine learning algorithm and corresponding sets of hyper-parameters, respectively, to obtain a model and a prediction result corresponding to each set of hyper-parameters, select a set of hyper-parameters suitable for the target data set from the sets of hyper-parameters according to the prediction result corresponding to each set of hyper-parameters, and finally determine a model corresponding to the selected set of hyper-parameters suitable for the target data set as the semi-supervised machine learning model of the target data set, thereby realizing the function of automatic semi-supervised machine learning. Compared with the existing problem that manual intervention is needed in the process of semi-supervised machine learning, the semi-supervised machine learning method and the semi-supervised machine learning system can determine the selection of the needed semi-supervised machine learning algorithm according to the empirical data set corresponding to the target data set, determine a group of hyper-parameters suitable for the target data set from a plurality of groups of hyper-parameters corresponding to the semi-supervised learning algorithm, and determine the semi-supervised machine model of the target data set according to the model corresponding to the group of hyper-parameters, so that the function of automatically performing semi-supervised learning is realized, and the problem of labor consumption caused by manual intervention in the process of modeling and selection is avoided.

In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

It will be appreciated that the relevant features of the method and apparatus described above are referred to one another. In addition, "first", "second", and the like in the above embodiments are for distinguishing the embodiments, and do not represent merits of the embodiments.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

The algorithms and displays presented herein are not inherently related to any particular computer, virtual machine, or other apparatus. Various general purpose systems may also be used with the teachings herein. The required structure for constructing such a system will be apparent from the description above. Moreover, the present invention is not directed to any particular programming language. It is appreciated that a variety of programming languages may be used to implement the teachings of the present invention as described herein, and any descriptions of specific languages are provided above to disclose the best mode of the invention.

In addition, the memory may include volatile memory in a computer readable medium, Random Access Memory (RAM) and/or nonvolatile memory such as Read Only Memory (ROM) or flash memory (flash RAM), and the memory includes at least one memory chip.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). The memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.

The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims

1. A method of implementing automatic semi-supervised machine learning, wherein the method comprises:

2. The method of claim 1, wherein said selecting a set of empirical data similar to a target set of data comprises:

acquiring a plurality of experience data sets;

3. The method of claim 2, wherein the dataset features include traditional meta-features and unsupervised cluster-based meta-features;

4. A method as claimed in claim 3, wherein the legacy meta-features include any one or more of the following features:

the inverse dataset dimension is the inverse of the dataset dimension;

5. The method of claim 3, wherein the unsupervised clustering-based meta-features comprise one or more of:

intra-class compactness;

degree of inter-class separation;

davison baud index;

dunne index.

6. The method of claim 3, wherein the preset unsupervised clustering algorithm comprises one or more unsupervised clustering algorithms;

7. The method of claim 1, wherein the selecting a set of hyper-parameters from the plurality of sets of hyper-parameters that fits the target dataset according to the prediction result corresponding to each set of hyper-parameters comprises:

8. An apparatus for implementing automatic semi-supervised machine learning, wherein the apparatus comprises:

9. A computer-readable storage medium, having a computer program stored thereon, wherein the computer program, when executed by one or more computing devices, implements the method of any of claims 1-7.

10. A system comprising one or more computing devices and one or more storage devices having a computer program recorded thereon, which, when executed by the one or more computing devices, causes the one or more computing devices to carry out the method of any one of claims 1-7.