WO2021042556A1 - Classification model training method, apparatus and device, and computer-readable storage medium - Google Patents

Classification model training method, apparatus and device, and computer-readable storage medium Download PDF

Info

Publication number
WO2021042556A1
WO2021042556A1 PCT/CN2019/118247 CN2019118247W WO2021042556A1 WO 2021042556 A1 WO2021042556 A1 WO 2021042556A1 CN 2019118247 W CN2019118247 W CN 2019118247W WO 2021042556 A1 WO2021042556 A1 WO 2021042556A1
Authority
WO
WIPO (PCT)
Prior art keywords
features
classification model
discrete
sample data
feature
Prior art date
Application number
PCT/CN2019/118247
Other languages
French (fr)
Chinese (zh)
Inventor
金戈
徐亮
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021042556A1 publication Critical patent/WO2021042556A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches

Definitions

  • This application relates to the field of artificial intelligence technology, in particular to classification model training methods, devices, equipment, and computer-readable storage media.
  • the main purpose of this application is to provide a classification model training method, device, equipment, and computer-readable storage medium, aiming to solve the technical problem of overfitting or low accuracy of the existing classification model.
  • the classification model training method includes the following steps: obtaining sample data, wherein the sample data includes labeled sample data and unlabeled sample data Processing the sample data based on a feature extraction algorithm to obtain features corresponding to the sample data, wherein the features of the sample data include discrete features and continuous features, the continuous features are in numerical form, and the discrete features are in non-numerical form; The discrete feature is processed based on the feature conversion method, and the discrete feature is converted into a continuous feature; the continuous feature and the continuous feature obtained by the discrete feature conversion are input into an auto-encoding algorithm for dimensionality reduction processing to obtain the The hidden features corresponding to the sample data; an initial classification model is constructed based on the labeled sample data and the hidden features, and the unlabeled sample data is performed based on the initial classification model and a preset desired step algorithm Label prediction; according to the prediction result, combined with the preset maximization step algorithm to optimize the initial classification model;
  • a second aspect of the present application provides a classification model training device, the classification model training device includes: a data acquisition module for acquiring sample data, wherein the sample data includes labeled sample data and unlabeled sample data;
  • the feature extraction module is used to process the sample data based on a feature extraction algorithm to obtain features corresponding to the sample data, where the features of the sample data include discrete features and continuous features, and the continuous features are in numerical form, and discrete features In a non-numerical form; a feature conversion module, used to process the discrete features based on a feature conversion method, and convert the discrete features into continuous features; a feature dimensionality reduction module, used to convert the continuous features and the discrete features The converted continuous features are input into the self-encoding algorithm for dimensionality reduction processing to obtain the hidden features corresponding to the sample data; the label prediction module is used to construct an initial classification model based on the labeled sample data and the hidden features , And perform label prediction on the unlabeled sample data based on the initial classification model and the preset
  • a third aspect of the present application provides a classification model training device, including: a memory and at least one processor, the memory stores instructions, the memory and the at least one processor are interconnected through a wire; the at least one processor The device invokes the instructions in the memory, so that the classification model training device executes the method described in the first aspect.
  • the fourth aspect of the present application provides a computer-readable storage medium, the computer-readable storage medium stores computer instructions, and when the computer instructions run on a computer, the computer executes the above-mentioned first aspect method.
  • the classification model training method, device, equipment, and computer-readable storage medium provided in this application first obtain labeled sample data and unlabeled sample data, and obtain discrete and continuous features corresponding to the sample data based on a feature extraction algorithm; Process discrete features, convert them into continuous features, and input all continuous features into the self-encoding algorithm for dimensionality reduction to obtain the hidden features corresponding to the sample data; build an initial classification model based on the labeled sample data and the hidden features , Perform label prediction on unlabeled sample data through the initial classification model and the preset expected step algorithm; according to the prediction result, combined with the preset maximization step algorithm to optimize the initial classification model, when the preset expected step algorithm starts When converging, confirm that the initial classification model training is completed, and save the completed initial classification model.
  • the classification model training method proposed in this application realizes the effective dimensionality reduction of features through a self-encoding algorithm, and combines the maximum expected value algorithm to improve the generalization ability of the classification model by using unlabeled sample data.
  • FIG. 1 is a schematic diagram of the structure of the classification model training device of the hardware operating environment involved in the embodiment of the application;
  • 3 is a schematic diagram of functional modules of an embodiment of the classification model training device in this application.
  • FIG. 4 is a schematic diagram of functional units of a feature conversion module in an embodiment of the classification model training device in this application;
  • FIG. 5 is a schematic diagram of functional units of a label prediction module in an embodiment of the classification model training device in this application;
  • FIG. 6 is a schematic diagram of functional units of a model optimization module in an embodiment of the classification model training device in this application;
  • FIG. 7 is a schematic structural diagram of a self-encoding algorithm in an embodiment of the classification model training method in this application.
  • the embodiments of the application provide a classification model training method, device, equipment, and computer-readable storage medium, which are used to achieve effective feature reduction through a self-encoding algorithm, combined with the maximum expected value algorithm, and use unlabeled sample data to improve The generalization ability of the classification model.
  • Fig. 1 is a schematic structural diagram of a classification model training device for a hardware operating environment involved in a solution of an embodiment of the application.
  • the classification model training device may include: a processor 1001, such as a CPU, a network interface 1004, a user interface 1003, a memory 1005, and a communication bus 1002.
  • the communication bus 1002 is used to implement connection and communication between these components.
  • the user interface 1003 may include a display screen (Display) and an input unit such as a keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface and a wireless interface.
  • the network interface 1004 may optionally include a standard wired interface and a wireless interface (such as a WI-FI interface).
  • the memory 1005 may be a high-speed RAM memory, or a non-volatile memory (non-volatile memory), such as a magnetic disk memory.
  • the memory 1005 may also be a storage device independent of the aforementioned processor 1001.
  • the memory 1005 as a computer storage medium may include an operating system, a network communication module, a user interface module, and a classification model training program.
  • the network interface 1004 is mainly used to connect to a back-end server and perform data communication with the back-end server;
  • the user interface 1003 is mainly used to connect to a client (user side) and perform data communication with the client;
  • the processor 1001 can be used to call the classification model training program stored in the memory 1005 and execute the operations of the following classification model training methods.
  • FIG. 2 is a schematic flowchart of an embodiment of a classification model training method according to the present application.
  • the classification model training method includes:
  • Step S10 Obtain sample data, where the sample data includes labeled sample data and unlabeled sample data.
  • the label specifically represents the type of population corresponding to the sample data.
  • the label of the sample data can be a high-consumption group; the sample data should also include the personal background information and consumption behavior of the group to be classified.
  • personal background information may include age, gender, occupation, income, city of residence, and educational background, and consumption behavior may include specific characteristics such as the user's monthly expenditure.
  • Step S20 Process the sample data based on the feature extraction algorithm to obtain features corresponding to the sample data, where the features of the sample data include discrete features and continuous features, the continuous features are in numerical form, and the discrete features are in non-numerical form.
  • sample data is mainly to perform feature extraction on sample data through feature extraction algorithms.
  • Feature extraction algorithms include but are not limited to principal component analysis, independent component analysis, and linear discriminant analysis.
  • the feature extraction algorithm of sample data is not limited.
  • the extracted features include discrete features and continuous features, where the continuous features are in numerical form and the discrete features are in non-numerical form.
  • the income in the sample data is a continuous feature
  • the city of residence is a discrete feature.
  • step S30 the discrete features are processed based on the feature conversion method, and the discrete features are converted into continuous features.
  • processing discrete features into continuous features includes the following three situations:
  • Discrete features have an order relationship.
  • the discrete feature of "level” can include “first level”, “second level” and “third level”. Therefore, such discrete features can be directly quantified , Transformed into continuous features;
  • Discrete features have a non-order relationship, and the number of discrete values of discrete features is less than or equal to the preset number, such as the discrete feature of "educational background".
  • the discrete values include college, undergraduate, master, and doctoral degrees. The number of values is limited. Therefore, such discrete features can be processed based on the one-hot encoding method and converted into continuous features;
  • Discrete features have a non-order relationship, and the number of discrete values of discrete features is greater than the preset number, such as the discrete feature of "residential city", if there are many discrete values, you can perform such discrete features
  • Derivative processing transforms the discrete feature of "residential city” into a continuous feature of a higher-level province or city.
  • Step S40 Input the continuous features obtained by converting the continuous features and the discrete features into the self-encoding algorithm for dimensionality reduction processing to obtain the hidden features corresponding to the sample data.
  • the self-encoding algorithm is an unsupervised learning method based on the hidden features of neural network learning.
  • the structure of the self-encoding algorithm is symmetrical.
  • the input is the continuous feature after feature conversion processing.
  • the self-encoding algorithm contains one or more hidden layers, and the output of the intermediate hidden layer is extracted as the implicit dimensionality reduction. Feature output.
  • the specific process is: the trained self-encoding algorithm converts the input continuous features into hidden features through encoding, and then decodes the hidden features to obtain output features similar to the input continuous features, realizing continuous input Feature dimensionality reduction.
  • step S50 an initial classification model is constructed based on the labeled sample data and hidden features, and label prediction is performed on the unlabeled sample data based on the initial classification model and a preset desired step algorithm.
  • a classification model is constructed to realize the semi-supervised learning of the maximum expected value algorithm.
  • the maximum expected value algorithm establishes an initial classification model on the basis of labeled sample data.
  • the classification model in this embodiment refers to a Gaussian mixture model.
  • the established initial Gaussian mixture model is used to predict the unlabeled data, and the initial Gaussian mixture model is optimized by combining the labeled sample data to obtain the final Gaussian mixture model that can be used for crowd classification.
  • the labels of different sample data can be the same or different; (X k+1 ), (X k+2 ),..., (X k+u ) represent sample data without labels.
  • the dependent variable in the sample data includes m categories, that is, the label of the sample data includes m categories, it can be known that m ⁇ k.
  • P(x) can be used to represent the probability value of the sample data X j on the i-th label.
  • the probability distribution of the Gaussian mixture model is shown in the following formula:
  • is the mixing coefficient
  • x is the eigenvector
  • is the mean vector of x
  • is the covariance matrix
  • label prediction is performed on the unlabeled sample data according to the initial Gaussian mixture model and the preset desired step algorithm, and the corresponding label is determined.
  • Step S60 According to the prediction result, the initial classification model is optimized in combination with the preset maximization step algorithm.
  • the parameters of the entire initial Gaussian mixture model are further optimized through the preset maximization step algorithm to prevent the initial Gaussian mixture
  • the model is overfitting or the label prediction is inaccurate.
  • Step S70 When it is detected that the preset desired step algorithm starts to converge, it is confirmed that the training of the initial classification model is completed, and the completed initial classification model is saved.
  • the online prediction of the type of crowd can be performed based on the trained classification model.
  • the new sample data that needs to be classified and predicted by the type of population
  • the new sample data needs to be preprocessed to obtain the feature information corresponding to the new sample data; and the corresponding feature information is input into the auto-encoding algorithm for dimensionality reduction ;
  • the dimensionality-reduced features are input into the Gaussian mixture model to realize the classification prediction of the crowd types.
  • the labeled sample data and the unlabeled sample data are first acquired, and the discrete and continuous features corresponding to the sample data are acquired based on the feature extraction algorithm; the discrete features are processed, converted into continuous features, and all The continuous features are input into the self-encoding algorithm for dimensionality reduction processing, and the hidden features corresponding to the sample data are obtained; the initial classification model is constructed based on the labeled sample data and the hidden features, and the initial classification model and the preset expected step algorithm Label sample data for label prediction; according to the prediction results, combined with the preset maximization step algorithm to optimize the initial classification model, when the preset expected step algorithm starts to converge, confirm that the initial classification model training is completed, and save the completed training Initial classification model.
  • the classification model training method proposed in this application realizes the effective dimensionality reduction of features through a self-encoding algorithm, and combines the maximum expected value algorithm to improve the generalization ability of the classification model by using unlabeled sample data.
  • step S50 includes:
  • Step S501 Determine the initial parameters ⁇ i , ⁇ i and ⁇ i of the initial classification model based on the labeled sample data and the hidden features, and construct the initial classification model based on the initial parameters, ⁇ i , ⁇ i and ⁇
  • the formula for calculating the initial value of i is as follows:
  • is the covariance matrix
  • X j is the sample data
  • ⁇ ij is the posterior probability containing hidden features
  • Step S502 in the initial classification model, perform label prediction on the unlabeled sample data through a preset expected step algorithm, and the preset expected step algorithm has the following formula:
  • ⁇ i is the mixing coefficient
  • the initial parameters ⁇ i and ⁇ i of the Gaussian mixture model are determined based on the labeled sample data and the hidden features And ⁇ i .
  • the calculation formulas for the initial values of the three parameters are as follows:
  • is the covariance matrix
  • X j is the sample data
  • ⁇ ij is the posterior probability containing hidden features
  • the initial classification model By determining the initial parameters of the Gaussian mixture model through the labeled sample data and hidden features, the initial classification model can be constructed. Based on the initial classification model to perform label prediction on the unlabeled sample data, it can be understood that the predicted label at this time may not be correct. Therefore, the initial classification model needs to be optimized through the maximization step algorithm. Specifically, the formula of the maximization step algorithm is as follows:
  • the initial parameters of the initial classification model are updated based on the maximization step algorithm to form a new Gaussian mixture model, and then label prediction is performed on the unlabeled sample data based on the new Gaussian mixture model until the preset desired step algorithm Start to converge, it can be regarded as the completion of model training.
  • the hidden features obtained through dimensionality reduction are input into the maximum expected value algorithm, and the classification model is semi-supervised learning combined with labeled and unlabeled sample data to prevent the classification model from over-fitting or under-fitting , To improve the generalization performance of the classification model.
  • FIG. 3 is a schematic diagram of functional modules of an embodiment of a classification model training apparatus according to the present application.
  • the classification model training device includes:
  • the data acquisition module 10 is configured to acquire sample data, where the sample data includes labeled sample data and unlabeled sample data;
  • the feature extraction module 20 is configured to process the sample data based on a feature extraction algorithm to obtain features corresponding to the sample data, where the features of the sample data include discrete features and continuous features, and the continuous features are in numerical form. Features are in non-numerical form;
  • the feature conversion module 30 is configured to process the discrete features based on a feature conversion method, and convert the discrete features into continuous features;
  • the feature dimensionality reduction module 40 is configured to input the continuous feature converted from the continuous feature and the discrete feature into an auto-encoding algorithm for dimensionality reduction processing to obtain the implicit feature corresponding to the sample data;
  • the label prediction module 50 is configured to construct an initial classification model based on the labeled sample data and the hidden features, and perform label prediction on the unlabeled sample data based on the initial classification model and a preset desired step algorithm ;
  • the model optimization module 60 is configured to optimize the initial classification model according to the prediction result in combination with a preset maximization step algorithm
  • the model saving module 70 is configured to confirm that the training of the initial classification model is completed when it is detected that the preset desired step algorithm starts to converge, and to save the completed initial classification model.
  • the feature conversion module 30 includes:
  • the quantization processing unit 301 is configured to perform quantization processing on the discrete features if the discrete features have an order relationship, and convert the discrete features into continuous features;
  • the encoding processing unit 302 is configured to: if the discrete features have a non-order relationship, and the number of discrete values of the discrete features is less than or equal to a preset number, perform a one-hot encoding method on the discrete features Processing to convert the discrete features into continuous features;
  • the derivation processing unit 303 is configured to, if the discrete features have a non-order relationship, and the number of discrete values of the discrete features is greater than a preset number, perform derivation processing on the discrete features to convert the discrete features It is a continuous feature.
  • the label prediction module 50 includes:
  • the model construction unit 501 is configured to determine the initial parameters ⁇ i , ⁇ i and ⁇ i of the initial classification model based on the labeled sample data and the hidden features, and construct the initial classification model based on the initial parameters, ⁇ i ,
  • the calculation formulas for the initial values of ⁇ i and ⁇ i are as follows:
  • is the covariance matrix
  • X j is the sample data
  • ⁇ ij is the posterior probability containing hidden features
  • the label prediction unit 502 is configured to perform label prediction on the unlabeled sample data by using a preset desired step algorithm in the initial classification model, and the preset desired step algorithm has the following formula:
  • ⁇ i is the mixing coefficient
  • the model optimization module 60 includes:
  • the model optimization unit 601 is used to obtain the preset maximization step algorithm formula as follows:
  • the initial parameters of the initial classification model are updated based on the formula.
  • the feature dimensionality reduction module 40 is specifically used for:
  • the initial hidden features are decoded to obtain the hidden features.
  • This application also provides a classification model training device, including: a memory and at least one processor, the memory stores instructions, the memory and the at least one processor are interconnected through a wire; the at least one processor calls the The instructions in the memory are used to cause the classification model training device to execute the steps in the above classification model training method.
  • the present application also provides a computer-readable storage medium.
  • the computer-readable storage medium may be a non-volatile computer-readable storage medium or a volatile computer-readable storage medium.
  • the computer-readable storage medium stores computer instructions, and when the computer instructions are executed on the computer, the computer executes the following steps:
  • sample data includes labeled sample data and unlabeled sample data
  • the features of the sample data include discrete features and continuous features, the continuous features are in a numerical form, and the discrete features are in a non-numerical form;

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present application relates to the technical field of artificial intelligence. Disclosed are a classification model training method, apparatus and device, and a computer-readable storage medium. The classification model training method comprises: acquiring sample data; obtaining, on the basis of a feature extraction algorithm, features corresponding to the sample data, wherein the features of the sample data comprise discrete features and continuous features; converting the discrete features into continuous features; inputting the continuous features into an autoencoder to obtain implicit features; constructing an initial classification model on the basis of labeled sample data and the implicit features, and performing label prediction on unlabeled sample data on the basis of the initial classification model and a preset expectation step algorithm; optimizing the initial classification model according to a prediction result combined with a preset maximization step algorithm; and when it is detected that the preset expectation step algorithm starts to converge, confirming that the training of the initial classification model is completed, and saving the trained initial classification model. By means of the present application, the generalization capability of a classification model is improved.

Description

分类模型训练方法、装置、设备及计算机可读存储介质Classification model training method, device, equipment and computer readable storage medium
本申请要求于2019年9月3日提交中国专利局、申请号为201910826406.8、发明名称为“分类模型训练方法、装置、设备及计算机可读存储介质”的中国专利申请的优先权,其全部内容通过引用结合在申请中。This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on September 3, 2019, the application number is 201910826406.8, and the invention title is "Classification Model Training Method, Apparatus, Equipment, and Computer-readable Storage Medium", and its entire contents Incorporated in the application by reference.
技术领域Technical field
本申请涉及人工智能技术领域,尤其涉及分类模型训练方法、装置、设备及计算机可读存储介质。This application relates to the field of artificial intelligence technology, in particular to classification model training methods, devices, equipment, and computer-readable storage media.
背景技术Background technique
在许多数据分类应用中,如文本分类、图像分类、以及特殊客群的挖掘等,都需要大量的样本来进行分类模型训练,其中,带标签的样本通常难以自动获取,一般都需要人工进行标注,因此,训练样本中带标签的样本数量通常较少,大部分都是不带标签的样本。发明人意识到在分类模型训练的过程中,由于存在大量的不带标签样本,可能会导致模型过拟合或者准确率不高。In many data classification applications, such as text classification, image classification, and mining of special customer groups, a large number of samples are required for classification model training. Among them, labeled samples are usually difficult to obtain automatically, and generally require manual labeling. Therefore, the number of labeled samples in the training samples is usually small, and most of them are unlabeled samples. The inventor realizes that in the process of training the classification model, due to the existence of a large number of unlabeled samples, the model may be over-fitted or the accuracy rate is not high.
发明内容Summary of the invention
本申请的主要目的在于提供一种分类模型训练方法、装置、设备及计算机可读存储介质,旨在解决现有的分类模型过拟合或者准确率不高的技术问题。The main purpose of this application is to provide a classification model training method, device, equipment, and computer-readable storage medium, aiming to solve the technical problem of overfitting or low accuracy of the existing classification model.
为实现上述目的,本申请第一方面提供了一种分类模型训练方法,所述分类模型训练方法包括以下步骤:获取样本数据,其中,所述样本数据包括带标签样本数据和不带标签样本数据;基于特征提取算法对所述样本数据进行处理,得到所述样本数据对应的特征,其中,所述样本数据的特征包括离散特征和连续特征,连续特征为数值形式,离散特征为非数值形式;基于特征转换方法对所述离散特征进行处理,将所述离散特征转换为连续特征;将所述连续特征和所述离散特征转换得到的连续特征输入至自编码算法中进行降维处理,得到所述样本数据对应的隐含特征;基于所述带标签样本数据和所述隐含特征构建初始分类模型,并基于所述初始分类模型和预设的期望步骤算法对所述不带标签样本数据进行标签预测;根据预测结果,结合预设的最大化步骤算法对所述初始分类模型进行优化;当检测到所述预设的期望步骤算法开始收敛时,确认所述初始分类模型训练完成,并保存训练完成的所述初始分类模型。In order to achieve the above objective, the first aspect of the present application provides a classification model training method. The classification model training method includes the following steps: obtaining sample data, wherein the sample data includes labeled sample data and unlabeled sample data Processing the sample data based on a feature extraction algorithm to obtain features corresponding to the sample data, wherein the features of the sample data include discrete features and continuous features, the continuous features are in numerical form, and the discrete features are in non-numerical form; The discrete feature is processed based on the feature conversion method, and the discrete feature is converted into a continuous feature; the continuous feature and the continuous feature obtained by the discrete feature conversion are input into an auto-encoding algorithm for dimensionality reduction processing to obtain the The hidden features corresponding to the sample data; an initial classification model is constructed based on the labeled sample data and the hidden features, and the unlabeled sample data is performed based on the initial classification model and a preset desired step algorithm Label prediction; according to the prediction result, combined with the preset maximization step algorithm to optimize the initial classification model; when it is detected that the preset expected step algorithm starts to converge, confirm that the initial classification model training is completed, and save The initial classification model that has been trained.
本申请第二方面提供了一种分类模型训练装置,所述分类模型训练装置包括:数据获取模块,用于获取样本数据,其中,所述样本数据包括带标签样本数据和不带标签样本数据;特征提取模块,用于基于特征提取算法对所述样本数据进行处理,得到所述样本数据对应的特征,其中,所述样本数据的特征包括离散特征和连续特征,连续特征为数值形式,离散特征为非数值形式;特征转换模块,用于基于特征转换方法对所述离散特征进行处理,将所述离散特征转换为连续特征;特征降维模块,用于将所述连续特征和所述离散特征转换得 到的连续特征输入至自编码算法中进行降维处理,得到所述样本数据对应的隐含特征;标签预测模块,用于基于所述带标签样本数据和所述隐含特征构建初始分类模型,并基于所述初始分类模型和预设的期望步骤算法对所述不带标签样本数据进行标签预测;模型优化模块,用于根据预测结果,结合预设的最大化步骤算法对所述初始分类模型进行优化;模型保存模块,用于当检测到所述预设的期望步骤算法开始收敛时,确认所述初始分类模型训练完成,并保存训练完成的所述初始分类模型。A second aspect of the present application provides a classification model training device, the classification model training device includes: a data acquisition module for acquiring sample data, wherein the sample data includes labeled sample data and unlabeled sample data; The feature extraction module is used to process the sample data based on a feature extraction algorithm to obtain features corresponding to the sample data, where the features of the sample data include discrete features and continuous features, and the continuous features are in numerical form, and discrete features In a non-numerical form; a feature conversion module, used to process the discrete features based on a feature conversion method, and convert the discrete features into continuous features; a feature dimensionality reduction module, used to convert the continuous features and the discrete features The converted continuous features are input into the self-encoding algorithm for dimensionality reduction processing to obtain the hidden features corresponding to the sample data; the label prediction module is used to construct an initial classification model based on the labeled sample data and the hidden features , And perform label prediction on the unlabeled sample data based on the initial classification model and the preset expected step algorithm; the model optimization module is used to perform the label prediction on the initial classification based on the prediction result and the preset maximization step algorithm The model is optimized; the model saving module is used to confirm that the training of the initial classification model is completed when it is detected that the preset desired step algorithm starts to converge, and to save the completed initial classification model.
本申请第三方面提供了一种分类模型训练设备,包括:存储器和至少一个处理器,所述存储器中存储有指令,所述存储器和所述至少一个处理器通过线路互联;所述至少一个处理器调用所述存储器中的所述指令,以使得所述分类模型训练设备执行上述第一方面所述的方法。A third aspect of the present application provides a classification model training device, including: a memory and at least one processor, the memory stores instructions, the memory and the at least one processor are interconnected through a wire; the at least one processor The device invokes the instructions in the memory, so that the classification model training device executes the method described in the first aspect.
本申请的第四方面提供了一种计算机可读存储介质,所述计算机可读存储介质中存储有计算机指令,当所述计算机指令在计算机上运行时,使得计算机执行上述第一方面所述的方法。The fourth aspect of the present application provides a computer-readable storage medium, the computer-readable storage medium stores computer instructions, and when the computer instructions run on a computer, the computer executes the above-mentioned first aspect method.
本申请提供的一种分类模型训练方法、装置、设备及计算机可读存储介质,首先获取带标签样本数据和不带标签样本数据,并基于特征提取算法获取样本数据对应的离散特征和连续特征;对离散特征进行处理,转换为连续特征,并将所有的连续特征输入至自编码算法中进行降维处理,得到样本数据对应的隐含特征;基于带标签样本数据和隐含特征构建初始分类模型,通过初始分类模型和预设的期望步骤算法对不带标签样本数据进行标签预测;根据预测结果,再结合预设的最大化步骤算法对初始分类模型进行优化,当预设的期望步骤算法开始收敛时,确认初始分类模型训练完成,并保存训练完成的初始分类模型。本申请提出的分类模型训练方法,通过自编码算法实现特征的有效降维,并结合最大期望值算法,利用不带标签的样本数据提升分类模型的泛化能力。The classification model training method, device, equipment, and computer-readable storage medium provided in this application first obtain labeled sample data and unlabeled sample data, and obtain discrete and continuous features corresponding to the sample data based on a feature extraction algorithm; Process discrete features, convert them into continuous features, and input all continuous features into the self-encoding algorithm for dimensionality reduction to obtain the hidden features corresponding to the sample data; build an initial classification model based on the labeled sample data and the hidden features , Perform label prediction on unlabeled sample data through the initial classification model and the preset expected step algorithm; according to the prediction result, combined with the preset maximization step algorithm to optimize the initial classification model, when the preset expected step algorithm starts When converging, confirm that the initial classification model training is completed, and save the completed initial classification model. The classification model training method proposed in this application realizes the effective dimensionality reduction of features through a self-encoding algorithm, and combines the maximum expected value algorithm to improve the generalization ability of the classification model by using unlabeled sample data.
附图说明Description of the drawings
图1为本申请中实施例方案涉及的硬件运行环境的分类模型训练设备结构示意图;FIG. 1 is a schematic diagram of the structure of the classification model training device of the hardware operating environment involved in the embodiment of the application;
图2为本申请中分类模型训练方法一实施例的流程示意图;2 is a schematic flowchart of an embodiment of the classification model training method in this application;
图3为本申请中分类模型训练装置一实施例的功能模块示意图;3 is a schematic diagram of functional modules of an embodiment of the classification model training device in this application;
图4为本申请中分类模型训练装置一实施例中特征转换模块的功能单元示意图;4 is a schematic diagram of functional units of a feature conversion module in an embodiment of the classification model training device in this application;
图5为本申请中分类模型训练装置一实施例中标签预测模块的功能单元示意图;FIG. 5 is a schematic diagram of functional units of a label prediction module in an embodiment of the classification model training device in this application;
图6为本申请中分类模型训练装置一实施例中模型优化模块的功能单元示意图;6 is a schematic diagram of functional units of a model optimization module in an embodiment of the classification model training device in this application;
图7为本申请中分类模型训练方法一实施例中自编码算法的结构示意图。FIG. 7 is a schematic structural diagram of a self-encoding algorithm in an embodiment of the classification model training method in this application.
具体实施方式detailed description
本申请实施例提供了一种分类模型训练方法、装置、设备及计算机可读存储介质,用于通过自编码算法实现特征的有效降维,并结合最大期望值算法,利用不带标签的样本数据提升分类模型的泛化能力。The embodiments of the application provide a classification model training method, device, equipment, and computer-readable storage medium, which are used to achieve effective feature reduction through a self-encoding algorithm, combined with the maximum expected value algorithm, and use unlabeled sample data to improve The generalization ability of the classification model.
为了使本技术领域的人员更好地理解本申请方案,下面将结合本申请实施例中的附图,对本申请实施例进行描述。In order to enable those skilled in the art to better understand the solution of the present application, the embodiments of the present application will be described below in conjunction with the accompanying drawings in the embodiments of the present application.
如图1所示,图1为本申请实施例方案涉及的硬件运行环境的分类模型训练设备结构示意图。As shown in Fig. 1, Fig. 1 is a schematic structural diagram of a classification model training device for a hardware operating environment involved in a solution of an embodiment of the application.
如图1所示,该分类模型训练设备可以包括:处理器1001,例如CPU,网络接口1004,用户接口1003,存储器1005,通信总线1002。其中,通信总线1002用于实现这些组件之间的连接通信。用户接口1003可以包括显示屏(Display)、输入单元比如键盘(Keyboard),可选用户接口1003还可以包括标准的有线接口、无线接口。网络接口1004可选地可以包括标准的有线接口、无线接口(如WI-FI接口)。存储器1005可以是高速RAM存储器,也可以是稳定的存储器(non-volatile memory),例如磁盘存储器。存储器1005可选地还可以是独立于前述处理器1001的存储装置。As shown in FIG. 1, the classification model training device may include: a processor 1001, such as a CPU, a network interface 1004, a user interface 1003, a memory 1005, and a communication bus 1002. Among them, the communication bus 1002 is used to implement connection and communication between these components. The user interface 1003 may include a display screen (Display) and an input unit such as a keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface and a wireless interface. The network interface 1004 may optionally include a standard wired interface and a wireless interface (such as a WI-FI interface). The memory 1005 may be a high-speed RAM memory, or a non-volatile memory (non-volatile memory), such as a magnetic disk memory. Optionally, the memory 1005 may also be a storage device independent of the aforementioned processor 1001.
如图1所示,作为一种计算机存储介质的存储器1005中可以包括操作系统、网络通信模块、用户接口模块以及分类模型训练程序。As shown in FIG. 1, the memory 1005 as a computer storage medium may include an operating system, a network communication module, a user interface module, and a classification model training program.
在图1所示的分类模型训练设备中,网络接口1004主要用于连接后台服务器,与后台服务器进行数据通信;用户接口1003主要用于连接客户端(用户端),与客户端进行数据通信;而处理器1001可以用于调用存储器1005中存储的分类模型训练程序,并执行以下分类模型训练方法的各实施例的操作。In the classification model training device shown in FIG. 1, the network interface 1004 is mainly used to connect to a back-end server and perform data communication with the back-end server; the user interface 1003 is mainly used to connect to a client (user side) and perform data communication with the client; The processor 1001 can be used to call the classification model training program stored in the memory 1005 and execute the operations of the following classification model training methods.
参照图2,图2为本申请分类模型训练方法一实施例的流程示意图,在该实施例中,分类模型训练方法包括:Referring to FIG. 2, FIG. 2 is a schematic flowchart of an embodiment of a classification model training method according to the present application. In this embodiment, the classification model training method includes:
步骤S10,获取样本数据,其中,样本数据包括带标签样本数据和不带标签样本数据。Step S10: Obtain sample data, where the sample data includes labeled sample data and unlabeled sample data.
本实施例中,首先获取对分类模型进行训练的样本数据,其中,样本数据中包括了大量不带标签的样本数据和少量带有标签的样本数据。以人群分类模型为例,标签即具体代表了样本数据所对应的人群种类,例如,样本数据的标签可以是高消费人群;样本数据中还应包括待分类人群的个人背景信息、消费行为等多个方面的信息,具体地,个人背景信息可以包括年龄、性别、职业、收入、居住城市和学历等,消费行为则可以包括用户的月支出等具体特征。In this embodiment, first obtain sample data for training the classification model, where the sample data includes a large amount of unlabeled sample data and a small amount of labeled sample data. Taking the population classification model as an example, the label specifically represents the type of population corresponding to the sample data. For example, the label of the sample data can be a high-consumption group; the sample data should also include the personal background information and consumption behavior of the group to be classified. In particular, personal background information may include age, gender, occupation, income, city of residence, and educational background, and consumption behavior may include specific characteristics such as the user's monthly expenditure.
步骤S20,基于特征提取算法对样本数据进行处理,得到样本数据对应的特征,其中,样本数据的特征包括离散特征和连续特征,连续特征为数值形式,离散特征为非数值形式。Step S20: Process the sample data based on the feature extraction algorithm to obtain features corresponding to the sample data, where the features of the sample data include discrete features and continuous features, the continuous features are in numerical form, and the discrete features are in non-numerical form.
进一步地,对样本数据进行处理,主要是通过特征提取算法对样本数据进行特征提取,特征提取算法包括但不限于主成分分析法、独立成分分析法及线 性判别分析等,本实施例中,对样本数据的特征提取算法并不做限制。Further, the processing of sample data is mainly to perform feature extraction on sample data through feature extraction algorithms. Feature extraction algorithms include but are not limited to principal component analysis, independent component analysis, and linear discriminant analysis. In this embodiment, The feature extraction algorithm of sample data is not limited.
在本实施例中,提取得到的特征包括离散特征和连续特征,其中,连续特征为数值形式,离散特征为非数值形式。例如,样本数据中的收入属于连续特征,而居住城市则属于离散特征。In this embodiment, the extracted features include discrete features and continuous features, where the continuous features are in numerical form and the discrete features are in non-numerical form. For example, the income in the sample data is a continuous feature, while the city of residence is a discrete feature.
步骤S30,基于特征转换方法对离散特征进行处理,将离散特征转换为连续特征。In step S30, the discrete features are processed based on the feature conversion method, and the discrete features are converted into continuous features.
进一步地,为了便于对分类模型进行训练,需要将提取得到的离散特征转化为连续特征。在本实施例中,对离散特征进行处理转化为连续特征包括以下三种情况:Further, in order to facilitate the training of the classification model, it is necessary to convert the extracted discrete features into continuous features. In this embodiment, processing discrete features into continuous features includes the following three situations:
1、离散特征具有次序关系,例如“等级”这一离散特征,可以包括“第一等级”、“第二等级”及“第三等级”等,因此,可以直接对此类离散特征进行量化处理,转化为连续特征;1. Discrete features have an order relationship. For example, the discrete feature of "level" can include "first level", "second level" and "third level". Therefore, such discrete features can be directly quantified , Transformed into continuous features;
2、离散特征具有非次序关系,且离散特征的离散取值个数小于或等于预设个数,例如“学历”这一离散特征,离散取值包括大专、本科、硕士及博士等,离散取值的个数有限,因此,可以基于one-hot独热编码方法对此类离散特征进行处理,转化为连续特征;2. Discrete features have a non-order relationship, and the number of discrete values of discrete features is less than or equal to the preset number, such as the discrete feature of "educational background". The discrete values include college, undergraduate, master, and doctoral degrees. The number of values is limited. Therefore, such discrete features can be processed based on the one-hot encoding method and converted into continuous features;
3、离散特征具有非次序关系,且离散特征的离散取值个数大于预设个数,例如“居住城市”这一离散特征,离散取值的个数多,则可以对此类离散特征进行衍生处理,将“居住城市”这一离散特征转化为更高等级的省份或市的连续特征。3. Discrete features have a non-order relationship, and the number of discrete values of discrete features is greater than the preset number, such as the discrete feature of "residential city", if there are many discrete values, you can perform such discrete features Derivative processing transforms the discrete feature of "residential city" into a continuous feature of a higher-level province or city.
步骤S40,将连续特征和离散特征转换得到的连续特征输入至自编码算法中进行降维处理,得到样本数据对应的隐含特征。Step S40: Input the continuous features obtained by converting the continuous features and the discrete features into the self-encoding algorithm for dimensionality reduction processing to obtain the hidden features corresponding to the sample data.
当完成对样本数据的特征提取得到连续特征和离散特征,并将其中的离散特征转化为连续特征后,将所有的连续特征输入至自编码算法中,以便基于自编码算法对所有的连续特征进行降维,得到隐含特征。After the feature extraction of the sample data is completed to obtain continuous features and discrete features, and the discrete features are converted into continuous features, all continuous features are input into the auto-encoding algorithm, so that all continuous features can be performed based on the self-encoding algorithm. Reduce dimensionality to get hidden features.
自编码算法是一种基于神经网络学习隐含特征的非监督学习方法,自编码算法结构为对称形式。如图7所示,在自编码算法中,输入的是经过特征转换处理后的连续特征,自编码算法中包含了一个或多个隐藏层,提取中间隐藏层的输出作为降维后的隐含特征输出。具体过程是:经过训练的自编码算法通过编码的方式将输入的连续特征转化为隐含特征,然后对隐含特征进行解码,得到与输入的连续特征相近的输出特征,实现了对输入的连续特征的降维。The self-encoding algorithm is an unsupervised learning method based on the hidden features of neural network learning. The structure of the self-encoding algorithm is symmetrical. As shown in Figure 7, in the self-encoding algorithm, the input is the continuous feature after feature conversion processing. The self-encoding algorithm contains one or more hidden layers, and the output of the intermediate hidden layer is extracted as the implicit dimensionality reduction. Feature output. The specific process is: the trained self-encoding algorithm converts the input continuous features into hidden features through encoding, and then decodes the hidden features to obtain output features similar to the input continuous features, realizing continuous input Feature dimensionality reduction.
步骤S50,基于带标签样本数据和隐含特征构建初始分类模型,并基于初始分类模型和预设的期望步骤算法对不带标签样本数据进行标签预测。In step S50, an initial classification model is constructed based on the labeled sample data and hidden features, and label prediction is performed on the unlabeled sample data based on the initial classification model and a preset desired step algorithm.
进一步地,在降维后输出的隐含特征的基础上,构建一个分类模型以实现最大期望值算法的半监督学习。具体地,最大期望值算法是通过在带有标签的样本数据的基础上建立初始分类模型,具体地,本实施例中的分类模型指的是高斯混合模型。通过建立的初始高斯混合模型对不带有标签的数据进行预测, 并结合带有标签的样本数据对初始高斯混合模型进行优化,以得到最终的可用于人群分类的高斯混合模型。Furthermore, on the basis of the hidden features output after dimensionality reduction, a classification model is constructed to realize the semi-supervised learning of the maximum expected value algorithm. Specifically, the maximum expected value algorithm establishes an initial classification model on the basis of labeled sample data. Specifically, the classification model in this embodiment refers to a Gaussian mixture model. The established initial Gaussian mixture model is used to predict the unlabeled data, and the initial Gaussian mixture model is optimized by combining the labeled sample data to obtain the final Gaussian mixture model that can be used for crowd classification.
具体地,在本实施例中,假设样本数据中包含k组带标签的样本数据,以及u组不带标签的样本数据,则可以将样本数据表示为D={(X 1,Y 1),(X 2,Y 2),…,(X k,Y k),(X k+1),(X k+2),…,(X k+u)}。其中,(X 1,Y 1),(X 2,Y 2),…,(X k,Y k)中X i表示的是样本数据,Y i表示的是第i组样本数据所带的标签,不同的样本数据所带的标签可以相同,也可以不同;(X k+1),(X k+2),…,(X k+u)表示的是不带标签的样本数据。 Specifically, in this embodiment, assuming that the sample data contains k sets of labeled sample data and u sets of unlabeled sample data, the sample data can be expressed as D={(X 1 , Y 1 ), (X 2 , Y 2 ),..., (X k , Y k ), (X k+1 ), (X k+2 ),..., (X k+u )}. Among them, (X 1 , Y 1 ), (X 2 , Y 2 ),..., (X k , Y k ) where X i represents the sample data, and Y i represents the label of the i-th group of sample data , The labels of different sample data can be the same or different; (X k+1 ), (X k+2 ),..., (X k+u ) represent sample data without labels.
进一步地,假设样本数据中的因变量包括m类,即样本数据的标签包括m类,可知m≤k。在本实施例中,P(x)可用来表示样本数据X j在第i类标签上的概率值,高斯混合模型的概率分布如下式所示: Further, assuming that the dependent variable in the sample data includes m categories, that is, the label of the sample data includes m categories, it can be known that m≤k. In this embodiment, P(x) can be used to represent the probability value of the sample data X j on the i-th label. The probability distribution of the Gaussian mixture model is shown in the following formula:
Figure PCTCN2019118247-appb-000001
Figure PCTCN2019118247-appb-000001
其中,π为混合系数,x为特征向量,μ为x的均值向量,∑为协方差矩阵。Among them, π is the mixing coefficient, x is the eigenvector, μ is the mean vector of x, and Σ is the covariance matrix.
对于带标签的样本数据X i来说,在标签Y i上的概率值为1,而对于其他类标签的概率值为0。 For a sample of tagged data X i, the probability of the label on the Y i value of 1, whereas for other types of tags probability value of zero.
在本实施例中,根据初始高斯混合模型和预设的期望步骤算法对不带标签样本数据进行标签预测,确定对应的标签。In this embodiment, label prediction is performed on the unlabeled sample data according to the initial Gaussian mixture model and the preset desired step algorithm, and the corresponding label is determined.
步骤S60,根据预测结果,结合预设的最大化步骤算法对初始分类模型进行优化。Step S60: According to the prediction result, the initial classification model is optimized in combination with the preset maximization step algorithm.
当通过初始高斯混合模型和预设的期望步骤算法确定不带标签样本数据对应的标签之后,再进一步地通过预设的最大化步骤算法对整个初始高斯混合模型的参数进行优化,防止初始高斯混合模型过拟合或标签预测不准确。After the initial Gaussian mixture model and the preset expected step algorithm determine the label corresponding to the unlabeled sample data, the parameters of the entire initial Gaussian mixture model are further optimized through the preset maximization step algorithm to prevent the initial Gaussian mixture The model is overfitting or the label prediction is inaccurate.
步骤S70,当检测到预设的期望步骤算法开始收敛时,确认初始分类模型训练完成,并保存训练完成的初始分类模型。Step S70: When it is detected that the preset desired step algorithm starts to converge, it is confirmed that the training of the initial classification model is completed, and the completed initial classification model is saved.
不断地重复上述基于预设的期望步骤算法对不带标签样本数据进行标签预测,以及基于预设的最大化步骤算法对整个初始高斯混合模型的参数进行优化的过程,直到预设的期望步骤算法开始收敛,则可视为分类模型训练完成。Repeat the process of label prediction based on the preset desired step algorithm for unlabeled sample data, and the process of optimizing the parameters of the entire initial Gaussian mixture model based on the preset maximum step algorithm until the preset desired step algorithm Start to converge, it can be regarded as the classification model training completed.
进一步地,在本实施例中,当分类模型训练完成后,即可基于训练好的分类模型进行人群种类的在线预测。对于需要进行人群种类分类预测的新样本数据来说,首先,需要对新样本数据进行预处理,以得到新样本数据对应的特征信息;并将相应的特征信息输入至自编码算法中进行降维;最后,将降维后的特征输入至高斯混合模型中,以实现人群种类的分类预测。Further, in this embodiment, after the training of the classification model is completed, the online prediction of the type of crowd can be performed based on the trained classification model. For the new sample data that needs to be classified and predicted by the type of population, first, the new sample data needs to be preprocessed to obtain the feature information corresponding to the new sample data; and the corresponding feature information is input into the auto-encoding algorithm for dimensionality reduction ; Finally, the dimensionality-reduced features are input into the Gaussian mixture model to realize the classification prediction of the crowd types.
在本实施例中,首先获取带标签样本数据和不带标签样本数据,并基于特征提取算法获取样本数据对应的离散特征和连续特征;对离散特征进行处理,转换为连续特征,并将所有的连续特征输入至自编码算法中进行降维处理,得到样本数据对应的隐含特征;基于带标签样本数据和隐含特征构建初始分类模 型,通过初始分类模型和预设的期望步骤算法对不带标签样本数据进行标签预测;根据预测结果,再结合预设的最大化步骤算法对初始分类模型进行优化,当预设的期望步骤算法开始收敛时,确认初始分类模型训练完成,并保存训练完成的初始分类模型。本申请提出的分类模型训练方法,通过自编码算法实现特征的有效降维,并结合最大期望值算法,利用不带标签的样本数据提升分类模型的泛化能力。In this embodiment, the labeled sample data and the unlabeled sample data are first acquired, and the discrete and continuous features corresponding to the sample data are acquired based on the feature extraction algorithm; the discrete features are processed, converted into continuous features, and all The continuous features are input into the self-encoding algorithm for dimensionality reduction processing, and the hidden features corresponding to the sample data are obtained; the initial classification model is constructed based on the labeled sample data and the hidden features, and the initial classification model and the preset expected step algorithm Label sample data for label prediction; according to the prediction results, combined with the preset maximization step algorithm to optimize the initial classification model, when the preset expected step algorithm starts to converge, confirm that the initial classification model training is completed, and save the completed training Initial classification model. The classification model training method proposed in this application realizes the effective dimensionality reduction of features through a self-encoding algorithm, and combines the maximum expected value algorithm to improve the generalization ability of the classification model by using unlabeled sample data.
进一步地,步骤S50包括:Further, step S50 includes:
步骤S501,基于所述带标签样本数据和所述隐含特征确定初始分类模型的初始参数π i、μ i以及∑ i,并基于所述初始参数构建初始分类模型,π i、μ i以及∑ i的初始值计算公式如下: Step S501: Determine the initial parameters π i , μ i and ∑ i of the initial classification model based on the labeled sample data and the hidden features, and construct the initial classification model based on the initial parameters, π i , μ i and ∑ The formula for calculating the initial value of i is as follows:
Figure PCTCN2019118247-appb-000002
Figure PCTCN2019118247-appb-000002
Figure PCTCN2019118247-appb-000003
Figure PCTCN2019118247-appb-000003
Figure PCTCN2019118247-appb-000004
Figure PCTCN2019118247-appb-000004
其中,∑为协方差矩阵,X j为样本数据,γ ij为包含隐含特征的后验概率; Among them, ∑ is the covariance matrix, X j is the sample data, and γ ij is the posterior probability containing hidden features;
步骤S502,在所述初始分类模型中,通过预设的期望步骤算法对所述不带标签样本数据进行标签预测,所述预设的期望步骤算法的公式如下:Step S502, in the initial classification model, perform label prediction on the unlabeled sample data through a preset expected step algorithm, and the preset expected step algorithm has the following formula:
Figure PCTCN2019118247-appb-000005
Figure PCTCN2019118247-appb-000005
其中,π i为混合系数。 Among them, π i is the mixing coefficient.
在本实施例中,当通过自编码算法对连续特征进行降维处理,得到样本数据包含的隐含特征后,基于带标签样本数据和隐含特征确定高斯混合模型的初始参数π i、μ i以及∑ i。具体地,三项参数的初始值计算公式如下: In this embodiment, when the dimensionality reduction process is performed on the continuous features through the self-encoding algorithm to obtain the hidden features contained in the sample data, the initial parameters π i and μ i of the Gaussian mixture model are determined based on the labeled sample data and the hidden features And ∑ i . Specifically, the calculation formulas for the initial values of the three parameters are as follows:
Figure PCTCN2019118247-appb-000006
Figure PCTCN2019118247-appb-000006
Figure PCTCN2019118247-appb-000007
Figure PCTCN2019118247-appb-000007
Figure PCTCN2019118247-appb-000008
Figure PCTCN2019118247-appb-000008
其中,∑为协方差矩阵,X j为样本数据,γ ij为包含隐含特征的后验概率。 Among them, Σ is the covariance matrix, X j is the sample data, and γ ij is the posterior probability containing hidden features.
通过带标签样本数据和隐含特征确定高斯混合模型的初始参数,即可构建初始分类模型。基于初始分类模型对不带标签样本数据进行标签预测,可以理解的是,此时预测出的标签不一定是正确的,因此,还需要通过最大化步骤算法对初始分类模型进行优化。具体地,最大化步骤算法的公式如下:By determining the initial parameters of the Gaussian mixture model through the labeled sample data and hidden features, the initial classification model can be constructed. Based on the initial classification model to perform label prediction on the unlabeled sample data, it can be understood that the predicted label at this time may not be correct. Therefore, the initial classification model needs to be optimized through the maximization step algorithm. Specifically, the formula of the maximization step algorithm is as follows:
Figure PCTCN2019118247-appb-000009
Figure PCTCN2019118247-appb-000009
Figure PCTCN2019118247-appb-000010
Figure PCTCN2019118247-appb-000010
Figure PCTCN2019118247-appb-000011
Figure PCTCN2019118247-appb-000011
根据预测结果,基于最大化步骤算法对初始分类模型的初始参数进行更新,形成新的高斯混合模型,再基于新的高斯混合模型对不带标签样本数据进行标签预测,直到预设的期望步骤算法开始收敛,则可视为模型训练完成。According to the prediction results, the initial parameters of the initial classification model are updated based on the maximization step algorithm to form a new Gaussian mixture model, and then label prediction is performed on the unlabeled sample data based on the new Gaussian mixture model until the preset desired step algorithm Start to converge, it can be regarded as the completion of model training.
在本实施例中,将经过降维所得的隐含特征输入至最大期望值算法中,结合带标签和不带标签的样本数据对分类模型进行半监督学习,防止分类模型过拟合或欠拟合,提升分类模型的泛化表现。In this embodiment, the hidden features obtained through dimensionality reduction are input into the maximum expected value algorithm, and the classification model is semi-supervised learning combined with labeled and unlabeled sample data to prevent the classification model from over-fitting or under-fitting , To improve the generalization performance of the classification model.
参照图3,图3为本申请分类模型训练装置一实施例的功能模块示意图。Referring to FIG. 3, FIG. 3 is a schematic diagram of functional modules of an embodiment of a classification model training apparatus according to the present application.
在本实施例中,分类模型训练装置包括:In this embodiment, the classification model training device includes:
数据获取模块10,用于获取样本数据,其中,所述样本数据包括带标签样本数据和不带标签样本数据;The data acquisition module 10 is configured to acquire sample data, where the sample data includes labeled sample data and unlabeled sample data;
特征提取模块20,用于基于特征提取算法对所述样本数据进行处理,得到所述样本数据对应的特征,其中,所述样本数据的特征包括离散特征和连续特征,连续特征为数值形式,离散特征为非数值形式;The feature extraction module 20 is configured to process the sample data based on a feature extraction algorithm to obtain features corresponding to the sample data, where the features of the sample data include discrete features and continuous features, and the continuous features are in numerical form. Features are in non-numerical form;
特征转换模块30,用于基于特征转换方法对所述离散特征进行处理,将所述离散特征转换为连续特征;The feature conversion module 30 is configured to process the discrete features based on a feature conversion method, and convert the discrete features into continuous features;
特征降维模块40,用于将所述连续特征和所述离散特征转换得到的连续特征输入至自编码算法中进行降维处理,得到所述样本数据对应的隐含特征;The feature dimensionality reduction module 40 is configured to input the continuous feature converted from the continuous feature and the discrete feature into an auto-encoding algorithm for dimensionality reduction processing to obtain the implicit feature corresponding to the sample data;
标签预测模块50,用于基于所述带标签样本数据和所述隐含特征构建初始分类模型,并基于所述初始分类模型和预设的期望步骤算法对所述不带标签样本数据进行标签预测;The label prediction module 50 is configured to construct an initial classification model based on the labeled sample data and the hidden features, and perform label prediction on the unlabeled sample data based on the initial classification model and a preset desired step algorithm ;
模型优化模块60,用于根据预测结果,结合预设的最大化步骤算法对所述初始分类模型进行优化;The model optimization module 60 is configured to optimize the initial classification model according to the prediction result in combination with a preset maximization step algorithm;
模型保存模块70,用于当检测到所述预设的期望步骤算法开始收敛时,确认所述初始分类模型训练完成,并保存训练完成的所述初始分类模型。The model saving module 70 is configured to confirm that the training of the initial classification model is completed when it is detected that the preset desired step algorithm starts to converge, and to save the completed initial classification model.
进一步地,参照图4,所述特征转换模块30包括:Further, referring to FIG. 4, the feature conversion module 30 includes:
量化处理单元301,用于若所述离散特征具有次序关系,则对所述离散特征进行量化处理,将所述离散特征转换为连续特征;The quantization processing unit 301 is configured to perform quantization processing on the discrete features if the discrete features have an order relationship, and convert the discrete features into continuous features;
编码处理单元302,用于若所述离散特征具有非次序关系,且所述离散特征的离散取值个数小于或等于预设个数,则基于one-hot独热编码方法对所述离散特征进行处理,将所述离散特征转换为连续特征;The encoding processing unit 302 is configured to: if the discrete features have a non-order relationship, and the number of discrete values of the discrete features is less than or equal to a preset number, perform a one-hot encoding method on the discrete features Processing to convert the discrete features into continuous features;
衍生处理单元303,用于若所述离散特征具有非次序关系,且所述离散特征的离散取值个数大于预设个数,则对所述离散特征进行衍生处理,将所述离散特征转换为连续特征。The derivation processing unit 303 is configured to, if the discrete features have a non-order relationship, and the number of discrete values of the discrete features is greater than a preset number, perform derivation processing on the discrete features to convert the discrete features It is a continuous feature.
进一步地,参照图5,标签预测模块50包括:Further, referring to FIG. 5, the label prediction module 50 includes:
模型构建单元501,用于基于所述带标签样本数据和所述隐含特征确定初始分类模型的初始参数π i、μ i以及∑ i,并基于所述初始参数构建初始分类模型,π i、μ i以及∑ i的初始值计算公式如下: The model construction unit 501 is configured to determine the initial parameters π i , μ i and ∑ i of the initial classification model based on the labeled sample data and the hidden features, and construct the initial classification model based on the initial parameters, π i , The calculation formulas for the initial values of μ i and ∑ i are as follows:
Figure PCTCN2019118247-appb-000012
Figure PCTCN2019118247-appb-000012
Figure PCTCN2019118247-appb-000013
Figure PCTCN2019118247-appb-000013
Figure PCTCN2019118247-appb-000014
Figure PCTCN2019118247-appb-000014
其中,∑为协方差矩阵,X j为样本数据,γ ij为包含隐含特征的后验概率; Among them, ∑ is the covariance matrix, X j is the sample data, and γ ij is the posterior probability containing hidden features;
标签预测单元502,用于在所述初始分类模型中,通过预设的期望步骤算法对所述不带标签样本数据进行标签预测,所述预设的期望步骤算法的公式如下:The label prediction unit 502 is configured to perform label prediction on the unlabeled sample data by using a preset desired step algorithm in the initial classification model, and the preset desired step algorithm has the following formula:
Figure PCTCN2019118247-appb-000015
Figure PCTCN2019118247-appb-000015
其中,π i为混合系数。 Among them, π i is the mixing coefficient.
进一步地,参照图6,模型优化模块60包括:Further, referring to FIG. 6, the model optimization module 60 includes:
模型优化单元601,用于获取预设的最大化步骤算法的公式如下:The model optimization unit 601 is used to obtain the preset maximization step algorithm formula as follows:
Figure PCTCN2019118247-appb-000016
Figure PCTCN2019118247-appb-000016
Figure PCTCN2019118247-appb-000017
Figure PCTCN2019118247-appb-000017
Figure PCTCN2019118247-appb-000018
Figure PCTCN2019118247-appb-000018
根据预测结果,基于所述公式对所述初始分类模型的初始参数进行更新。According to the prediction result, the initial parameters of the initial classification model are updated based on the formula.
进一步地,特征降维模块40具体用于:Further, the feature dimensionality reduction module 40 is specifically used for:
将连续特征输入至自编码算法中进行降维处理,得到初始隐含特征;Input the continuous features into the self-encoding algorithm for dimensionality reduction to obtain the initial hidden features;
对初始隐含特征进行解码,得到隐含特征。The initial hidden features are decoded to obtain the hidden features.
本申请还提供一种分类模型训练设备,包括:存储器和至少一个处理器,所述存储器中存储有指令,所述存储器和所述至少一个处理器通过线路互联;所述至少一个处理器调用所述存储器中的所述指令,以使得所述分类模型训练设备执行上述分类模型训练方法中的步骤。This application also provides a classification model training device, including: a memory and at least one processor, the memory stores instructions, the memory and the at least one processor are interconnected through a wire; the at least one processor calls the The instructions in the memory are used to cause the classification model training device to execute the steps in the above classification model training method.
本申请还提供一种计算机可读存储介质,该计算机可读存储介质可以为非易失性计算机可读存储介质,也可以为易失性计算机可读存储介质。计算机可读存储介质存储有计算机指令,当所述计算机指令在计算机上运行时,使得计算机执行如下步骤:The present application also provides a computer-readable storage medium. The computer-readable storage medium may be a non-volatile computer-readable storage medium or a volatile computer-readable storage medium. The computer-readable storage medium stores computer instructions, and when the computer instructions are executed on the computer, the computer executes the following steps:
获取样本数据,其中,所述样本数据包括带标签样本数据和不带标签样本数据;Acquiring sample data, where the sample data includes labeled sample data and unlabeled sample data;
基于特征提取算法对所述样本数据进行处理,得到所述样本数据对应的特征,其中,所述样本数据的特征包括离散特征和连续特征,连续特征为数值形式,离散特征为非数值形式;Processing the sample data based on a feature extraction algorithm to obtain features corresponding to the sample data, wherein the features of the sample data include discrete features and continuous features, the continuous features are in a numerical form, and the discrete features are in a non-numerical form;
基于特征转换方法对所述离散特征进行处理,将所述离散特征转换为连续特征;Processing the discrete features based on a feature conversion method, and converting the discrete features into continuous features;
将所述连续特征和所述离散特征转换得到的连续特征输入至自编码算法中进行降维处理,得到所述样本数据对应的隐含特征;Input the continuous feature and the continuous feature obtained by the conversion of the discrete feature into an auto-encoding algorithm for dimensionality reduction processing to obtain the hidden feature corresponding to the sample data;
基于所述带标签样本数据和所述隐含特征构建初始分类模型,并基于所述初始分类模型和预设的期望步骤算法对所述不带标签样本数据进行标签预测;Constructing an initial classification model based on the labeled sample data and the hidden features, and performing label prediction on the unlabeled sample data based on the initial classification model and a preset desired step algorithm;
根据预测结果,结合预设的最大化步骤算法对所述初始分类模型进行优化;Optimizing the initial classification model according to the prediction result in combination with a preset maximization step algorithm;
当检测到所述预设的期望步骤算法开始收敛时,确认所述初始分类模型训练完成,并保存训练完成的所述初始分类模型。When it is detected that the preset desired step algorithm starts to converge, it is confirmed that the training of the initial classification model is completed, and the completed initial classification model is saved.
以上所述,以上实施例仅用以说明本申请的技术方案,而非对其限制;本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案 进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围。As mentioned above, the above embodiments are only used to illustrate the technical solutions of the present application, not to limit them; those of ordinary skill in the art should understand that they can still modify or modify the technical solutions described in the foregoing embodiments. Some of the technical features are equivalently replaced; and these modifications or replacements do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present application.

Claims (20)

  1. 一种分类模型训练方法,所述分类模型训练方法包括以下步骤:A classification model training method, the classification model training method includes the following steps:
    获取样本数据,其中,所述样本数据包括带标签样本数据和不带标签样本数据;Acquiring sample data, where the sample data includes labeled sample data and unlabeled sample data;
    基于特征提取算法对所述样本数据进行处理,得到所述样本数据对应的特征,其中,所述样本数据的特征包括离散特征和连续特征,连续特征为数值形式,离散特征为非数值形式;Processing the sample data based on a feature extraction algorithm to obtain features corresponding to the sample data, wherein the features of the sample data include discrete features and continuous features, the continuous features are in a numerical form, and the discrete features are in a non-numerical form;
    基于特征转换方法对所述离散特征进行处理,将所述离散特征转换为连续特征;Processing the discrete features based on a feature conversion method, and converting the discrete features into continuous features;
    将所述连续特征和所述离散特征转换得到的连续特征输入至自编码算法中进行降维处理,得到所述样本数据对应的隐含特征;Input the continuous feature and the continuous feature obtained by the conversion of the discrete feature into an auto-encoding algorithm for dimensionality reduction processing to obtain the hidden feature corresponding to the sample data;
    基于所述带标签样本数据和所述隐含特征构建初始分类模型,并基于所述初始分类模型和预设的期望步骤算法对所述不带标签样本数据进行标签预测;Constructing an initial classification model based on the labeled sample data and the hidden features, and performing label prediction on the unlabeled sample data based on the initial classification model and a preset desired step algorithm;
    根据预测结果,结合预设的最大化步骤算法对所述初始分类模型进行优化;Optimizing the initial classification model according to the prediction result in combination with a preset maximization step algorithm;
    当检测到所述预设的期望步骤算法开始收敛时,确认所述初始分类模型训练完成,并保存训练完成的所述初始分类模型。When it is detected that the preset desired step algorithm starts to converge, it is confirmed that the training of the initial classification model is completed, and the completed initial classification model is saved.
  2. 如权利要求1所述的分类模型训练方法,所述基于特征转换方法对所述离散特征进行处理,将所述离散特征转换为连续特征包括:8. The classification model training method according to claim 1, wherein the processing of the discrete features based on the feature conversion method, and converting the discrete features into continuous features comprises:
    若所述离散特征具有次序关系,则对所述离散特征进行量化处理,将所述离散特征转换为连续特征;If the discrete features have an order relationship, perform quantization processing on the discrete features, and convert the discrete features into continuous features;
    若所述离散特征具有非次序关系,且所述离散特征的离散取值个数小于或等于预设个数,则基于one-hot独热编码方法对所述离散特征进行处理,将所述离散特征转换为连续特征;If the discrete feature has a non-order relationship, and the number of discrete values of the discrete feature is less than or equal to the preset number, the discrete feature is processed based on the one-hot one-hot encoding method, and the discrete Convert features to continuous features;
    若所述离散特征具有非次序关系,且所述离散特征的离散取值个数大于预设个数,则对所述离散特征进行衍生处理,将所述离散特征转换为连续特征。If the discrete feature has a non-order relationship, and the number of discrete values of the discrete feature is greater than the preset number, then the discrete feature is derivatized to convert the discrete feature into a continuous feature.
  3. 如权利要求1所述的分类模型训练方法,所述基于所述带标签样本数据和所述隐含特征构建初始分类模型,并基于所述初始分类模型和预设的期望步骤算法对所述不带标签样本数据进行标签预测包括:The classification model training method according to claim 1, wherein the initial classification model is constructed based on the labeled sample data and the hidden features, and the initial classification model is based on the initial classification model and a preset desired step algorithm for the Labeled sample data for label prediction includes:
    基于所述带标签样本数据和所述隐含特征确定初始分类模型的初始参数π i、μ i以及∑ i,并基于所述初始参数构建初始分类模型,π i、μ i以及∑ i的初始值计算公式如下: The initial parameters π i , μ i and ∑ i of the initial classification model are determined based on the labeled sample data and the hidden features, and the initial classification model is constructed based on the initial parameters. The initial parameters of π i , μ i and ∑ i The value calculation formula is as follows:
    Figure PCTCN2019118247-appb-100001
    Figure PCTCN2019118247-appb-100001
    Figure PCTCN2019118247-appb-100002
    Figure PCTCN2019118247-appb-100002
    Figure PCTCN2019118247-appb-100003
    Figure PCTCN2019118247-appb-100003
    其中,∑为协方差矩阵,X j为样本数据,γ ij为包含隐含特征的后验概率; Among them, ∑ is the covariance matrix, X j is the sample data, and γ ij is the posterior probability containing hidden features;
    在所述初始分类模型中,通过预设的期望步骤算法对所述不带标签样本数据进行标签预测,所述预设的期望步骤算法的公式如下:In the initial classification model, label prediction is performed on the unlabeled sample data through a preset desired step algorithm, and the formula of the preset desired step algorithm is as follows:
    Figure PCTCN2019118247-appb-100004
    Figure PCTCN2019118247-appb-100004
    其中,π i为混合系数。 Among them, π i is the mixing coefficient.
  4. 如权利要求3所述的分类模型训练方法,所述根据预测结果,结合预设的最大化步骤算法对所述初始分类模型进行优化包括:The classification model training method according to claim 3, wherein the optimization of the initial classification model based on the prediction result in combination with a preset maximization step algorithm comprises:
    获取预设的最大化步骤算法的公式如下:The formula for obtaining the preset maximization step algorithm is as follows:
    Figure PCTCN2019118247-appb-100005
    Figure PCTCN2019118247-appb-100005
    Figure PCTCN2019118247-appb-100006
    Figure PCTCN2019118247-appb-100006
    Figure PCTCN2019118247-appb-100007
    Figure PCTCN2019118247-appb-100007
    根据预测结果,基于所述公式对所述初始分类模型的初始参数进行更新。According to the prediction result, the initial parameters of the initial classification model are updated based on the formula.
  5. 如权利要求1所述的分类模型训练方法,所述将所述连续特征和所述离散特征转换得到的连续特征输入至自编码算法中进行降维处理,得到所述样本数据对应的隐含特征包括:The classification model training method according to claim 1, wherein the continuous features obtained by converting the continuous features and the discrete features are input into an auto-encoding algorithm for dimensionality reduction processing to obtain the hidden features corresponding to the sample data include:
    将连续特征输入至自编码算法中进行降维处理,得到初始隐含特征;Input the continuous features into the self-encoding algorithm for dimensionality reduction to obtain the initial hidden features;
    对初始隐含特征进行解码,得到隐含特征。The initial hidden features are decoded to obtain the hidden features.
  6. 一种分类模型训练装置,所述分类模型训练装置包括:A classification model training device, the classification model training device includes:
    数据获取模块,用于获取样本数据,其中,所述样本数据包括带标签样本数据和不带标签样本数据;A data acquisition module for acquiring sample data, where the sample data includes labeled sample data and unlabeled sample data;
    特征提取模块,用于基于特征提取算法对所述样本数据进行处理,得到所述样本数据对应的特征,其中,所述样本数据的特征包括离散特征和连续特征,连续特征为数值形式,离散特征为非数值形式;The feature extraction module is used to process the sample data based on a feature extraction algorithm to obtain features corresponding to the sample data, where the features of the sample data include discrete features and continuous features, and the continuous features are in numerical form, and discrete features Is a non-numerical form;
    特征转换模块,用于基于特征转换方法对所述离散特征进行处理,将所述离散特征转换为连续特征;A feature conversion module, configured to process the discrete features based on a feature conversion method, and convert the discrete features into continuous features;
    特征降维模块,用于将所述连续特征和所述离散特征转换得到的连续特征输入至自编码算法中进行降维处理,得到所述样本数据对应的隐含特征;The feature dimensionality reduction module is used to input the continuous feature obtained by the conversion of the continuous feature and the discrete feature into an auto-encoding algorithm for dimensionality reduction processing to obtain the hidden feature corresponding to the sample data;
    标签预测模块,用于基于所述带标签样本数据和所述隐含特征构建初始分类模型,并基于所述初始分类模型和预设的期望步骤算法对所述不带标签样本数据进行标签预测;The label prediction module is configured to construct an initial classification model based on the labeled sample data and the hidden features, and perform label prediction on the unlabeled sample data based on the initial classification model and a preset desired step algorithm;
    模型优化模块,用于根据预测结果,结合预设的最大化步骤算法对所述初始分类模型进行优化;The model optimization module is used to optimize the initial classification model according to the prediction result in combination with a preset maximization step algorithm;
    模型保存模块,用于当检测到所述预设的期望步骤算法开始收敛时,确认所述初始分类模型训练完成,并保存训练完成的所述初始分类模型。The model saving module is used to confirm the completion of the training of the initial classification model when it is detected that the preset desired step algorithm starts to converge, and save the completed initial classification model.
  7. 如权利要求6所述的分类模型训练装置,所述特征转换模块包括:7. The classification model training device according to claim 6, wherein the feature conversion module comprises:
    量化处理单元,用于若所述离散特征具有次序关系,则对所述离散特征进行量化处理,将所述离散特征转换为连续特征;A quantization processing unit, configured to perform quantization processing on the discrete features if the discrete features have an order relationship, and convert the discrete features into continuous features;
    编码处理单元,用于若所述离散特征具有非次序关系,且所述离散特征的离散取值个数小于或等于预设个数,则基于one-hot独热编码方法对所述离散特征进行处理,将所述离散特征转换为连续特征;The encoding processing unit is configured to perform the discrete feature on the discrete feature based on the one-hot encoding method if the discrete feature has a non-order relationship and the number of discrete values of the discrete feature is less than or equal to a preset number Processing, converting the discrete features into continuous features;
    衍生处理单元,用于若所述离散特征具有非次序关系,且所述离散特征的离散取值个数大于预设个数,则对所述离散特征进行衍生处理,将所述离散特征转换为连续特征。A derivation processing unit, configured to, if the discrete features have a non-order relationship, and the number of discrete values of the discrete features is greater than a preset number, perform derivation processing on the discrete features to convert the discrete features into Continuous features.
  8. 如权利要求6所述的分类模型训练装置,所述标签预测模块包括:The classification model training device according to claim 6, wherein the label prediction module comprises:
    模型构建单元,用于基于所述带标签样本数据和所述隐含特征确定初始分类模型的初始参数π i、μ i以及∑ i,并基于所述初始参数构建初始分类模型,π i、μ i以及∑ i的初始值计算公式如下: The model construction unit is used to determine the initial parameters π i , μ i and ∑ i of the initial classification model based on the labeled sample data and the hidden features, and construct the initial classification model based on the initial parameters, π i , μ The calculation formulas for the initial values of i and ∑ i are as follows:
    Figure PCTCN2019118247-appb-100008
    Figure PCTCN2019118247-appb-100008
    Figure PCTCN2019118247-appb-100009
    Figure PCTCN2019118247-appb-100009
    Figure PCTCN2019118247-appb-100010
    Figure PCTCN2019118247-appb-100010
    其中,∑为协方差矩阵,X j为样本数据,γ ij为包含隐含特征的后验概率; Among them, ∑ is the covariance matrix, X j is the sample data, and γ ij is the posterior probability containing hidden features;
    标签预测单元,用于在所述初始分类模型中,通过预设的期望步骤算法对所述不带标签样本数据进行标签预测,所述预设的期望步骤算法的公式如下:The label prediction unit is configured to perform label prediction on the unlabeled sample data through a preset desired step algorithm in the initial classification model, and the preset desired step algorithm has the following formula:
    Figure PCTCN2019118247-appb-100011
    Figure PCTCN2019118247-appb-100011
    其中,π i为混合系数。 Among them, π i is the mixing coefficient.
  9. 如权利要求8所述的分类模型训练装置,所述模型优化模块包括:8. The classification model training device according to claim 8, wherein the model optimization module comprises:
    模型优化单元,用于获取预设的最大化步骤算法的公式如下:The model optimization unit is used to obtain the preset maximization step algorithm formula as follows:
    Figure PCTCN2019118247-appb-100012
    Figure PCTCN2019118247-appb-100012
    Figure PCTCN2019118247-appb-100013
    Figure PCTCN2019118247-appb-100013
    Figure PCTCN2019118247-appb-100014
    Figure PCTCN2019118247-appb-100014
    根据预测结果,基于所述公式对所述初始分类模型的初始参数进行更新。According to the prediction result, the initial parameters of the initial classification model are updated based on the formula.
  10. 如权利要求6所述的分类模型训练装置,所述特征转换模块具体用于:The classification model training device according to claim 6, wherein the feature conversion module is specifically configured to:
    将连续特征输入至自编码算法中进行降维处理,得到初始隐含特征;Input the continuous features into the self-encoding algorithm for dimensionality reduction to obtain the initial hidden features;
    对初始隐含特征进行解码,得到隐含特征。The initial hidden features are decoded to obtain the hidden features.
  11. 一种分类模型训练设备,包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现如下步骤:A classification model training device includes a memory, a processor, and a computer program stored on the memory and capable of running on the processor, and the processor implements the following steps when the processor executes the computer program:
    获取样本数据,其中,所述样本数据包括带标签样本数据和不带标签样本数据;Acquiring sample data, where the sample data includes labeled sample data and unlabeled sample data;
    基于特征提取算法对所述样本数据进行处理,得到所述样本数据对应的特征,其中,所述样本数据的特征包括离散特征和连续特征,连续特征为数值形式,离散特征为非数值形式;Processing the sample data based on a feature extraction algorithm to obtain features corresponding to the sample data, wherein the features of the sample data include discrete features and continuous features, the continuous features are in a numerical form, and the discrete features are in a non-numerical form;
    基于特征转换方法对所述离散特征进行处理,将所述离散特征转换为连续特征;Processing the discrete features based on a feature conversion method, and converting the discrete features into continuous features;
    将所述连续特征和所述离散特征转换得到的连续特征输入至自编码算法中进行降维处理,得到所述样本数据对应的隐含特征;Input the continuous feature and the continuous feature obtained by the conversion of the discrete feature into an auto-encoding algorithm for dimensionality reduction processing to obtain the hidden feature corresponding to the sample data;
    基于所述带标签样本数据和所述隐含特征构建初始分类模型,并基于所述初始分类模型和预设的期望步骤算法对所述不带标签样本数据进行标签预测;Constructing an initial classification model based on the labeled sample data and the hidden features, and performing label prediction on the unlabeled sample data based on the initial classification model and a preset desired step algorithm;
    根据预测结果,结合预设的最大化步骤算法对所述初始分类模型进行优化;Optimizing the initial classification model according to the prediction result in combination with a preset maximization step algorithm;
    当检测到所述预设的期望步骤算法开始收敛时,确认所述初始分类模型训练完成,并保存训练完成的所述初始分类模型。When it is detected that the preset desired step algorithm starts to converge, it is confirmed that the training of the initial classification model is completed, and the completed initial classification model is saved.
  12. 如权利要求11所述的分类模型训练设备,所述处理器执行所述计算机程序时实现所述基于特征转换方法对所述离散特征进行处理,将所述离散特征转换为连续特征时,包括以下步骤:The classification model training device according to claim 11, wherein the processor executes the computer program to implement the feature conversion method to process the discrete features, and when the discrete features are converted into continuous features, the process includes the following step:
    若所述离散特征具有次序关系,则对所述离散特征进行量化处理,将所述离散特征转换为连续特征;If the discrete features have an order relationship, perform quantization processing on the discrete features, and convert the discrete features into continuous features;
    若所述离散特征具有非次序关系,且所述离散特征的离散取值个数小于或 等于预设个数,则基于one-hot独热编码方法对所述离散特征进行处理,将所述离散特征转换为连续特征;If the discrete feature has a non-order relationship, and the number of discrete values of the discrete feature is less than or equal to the preset number, the discrete feature is processed based on the one-hot one-hot encoding method, and the discrete Convert features to continuous features;
    若所述离散特征具有非次序关系,且所述离散特征的离散取值个数大于预设个数,则对所述离散特征进行衍生处理,将所述离散特征转换为连续特征。If the discrete feature has a non-order relationship, and the number of discrete values of the discrete feature is greater than the preset number, then the discrete feature is derivatized to convert the discrete feature into a continuous feature.
  13. 如权利要求11所述的分类模型训练设备,所述处理器执行所述计算机程序时实现所述基于所述带标签样本数据和所述隐含特征构建初始分类模型,并基于所述初始分类模型和预设的期望步骤算法对所述不带标签样本数据进行标签预测时,包括以下步骤:The classification model training device according to claim 11, wherein the processor implements the construction of the initial classification model based on the labeled sample data and the hidden features when the computer program is executed, and is based on the initial classification model When performing label prediction on the unlabeled sample data with the preset expected step algorithm, the following steps are included:
    基于所述带标签样本数据和所述隐含特征确定初始分类模型的初始参数π i、μ i以及∑ i,并基于所述初始参数构建初始分类模型,π i、μ i以及∑ i的初始值计算公式如下: The initial parameters π i , μ i and ∑ i of the initial classification model are determined based on the labeled sample data and the hidden features, and the initial classification model is constructed based on the initial parameters. The initial parameters of π i , μ i and ∑ i The value calculation formula is as follows:
    Figure PCTCN2019118247-appb-100015
    Figure PCTCN2019118247-appb-100015
    Figure PCTCN2019118247-appb-100016
    Figure PCTCN2019118247-appb-100016
    Figure PCTCN2019118247-appb-100017
    Figure PCTCN2019118247-appb-100017
    其中,∑为协方差矩阵,X j为样本数据,γ ij为包含隐含特征的后验概率; Among them, ∑ is the covariance matrix, X j is the sample data, and γ ij is the posterior probability containing hidden features;
    在所述初始分类模型中,通过预设的期望步骤算法对所述不带标签样本数据进行标签预测,所述预设的期望步骤算法的公式如下:In the initial classification model, label prediction is performed on the unlabeled sample data through a preset desired step algorithm, and the formula of the preset desired step algorithm is as follows:
    Figure PCTCN2019118247-appb-100018
    Figure PCTCN2019118247-appb-100018
    其中,π i为混合系数。 Among them, π i is the mixing coefficient.
  14. 如权利要求13所述的分类模型训练设备,设备,所述处理器执行所述计算机程序时实现所述根据预测结果,结合预设的最大化步骤算法对所述初始分类模型进行优化时,包括以下步骤:The classification model training device according to claim 13, when the processor executes the computer program to realize the optimization of the initial classification model according to the prediction result in combination with a preset maximization step algorithm, comprising: The following steps:
    获取预设的最大化步骤算法的公式如下:The formula for obtaining the preset maximization step algorithm is as follows:
    Figure PCTCN2019118247-appb-100019
    Figure PCTCN2019118247-appb-100019
    Figure PCTCN2019118247-appb-100020
    Figure PCTCN2019118247-appb-100020
    Figure PCTCN2019118247-appb-100021
    Figure PCTCN2019118247-appb-100021
    根据预测结果,基于所述公式对所述初始分类模型的初始参数进行更新。According to the prediction result, the initial parameters of the initial classification model are updated based on the formula.
  15. 如权利要求11所述的分类模型训练设备,设备,所述处理器执行所述计算机程序时实现所述将所述连续特征和所述离散特征转换得到的连续特征输入至自编码算法中进行降维处理,得到所述样本数据对应的隐含特征时,包括以下步骤:The classification model training device according to claim 11, the device, the processor implements the input of the continuous feature and the continuous feature obtained by the conversion of the discrete feature into an auto-encoding algorithm when the processor executes the computer program. Dimension processing to obtain the hidden features corresponding to the sample data includes the following steps:
    将连续特征输入至自编码算法中进行降维处理,得到初始隐含特征;Input the continuous features into the self-encoding algorithm for dimensionality reduction to obtain the initial hidden features;
    对初始隐含特征进行解码,得到隐含特征。The initial hidden features are decoded to obtain the hidden features.
  16. 一种计算机可读存储介质,所述计算机可读存储介质中存储计算机指令,当所述计算机指令在计算机上运行时,使得计算机执行如下步骤:A computer-readable storage medium stores computer instructions in the computer-readable storage medium, and when the computer instructions are executed on a computer, the computer executes the following steps:
    获取样本数据,其中,所述样本数据包括带标签样本数据和不带标签样本数据;Acquiring sample data, where the sample data includes labeled sample data and unlabeled sample data;
    基于特征提取算法对所述样本数据进行处理,得到所述样本数据对应的特征,其中,所述样本数据的特征包括离散特征和连续特征,连续特征为数值形式,离散特征为非数值形式;Processing the sample data based on a feature extraction algorithm to obtain features corresponding to the sample data, wherein the features of the sample data include discrete features and continuous features, the continuous features are in a numerical form, and the discrete features are in a non-numerical form;
    基于特征转换方法对所述离散特征进行处理,将所述离散特征转换为连续特征;Processing the discrete features based on a feature conversion method, and converting the discrete features into continuous features;
    将所述连续特征和所述离散特征转换得到的连续特征输入至自编码算法中进行降维处理,得到所述样本数据对应的隐含特征;Input the continuous feature and the continuous feature obtained by the conversion of the discrete feature into an auto-encoding algorithm for dimensionality reduction processing to obtain the hidden feature corresponding to the sample data;
    基于所述带标签样本数据和所述隐含特征构建初始分类模型,并基于所述初始分类模型和预设的期望步骤算法对所述不带标签样本数据进行标签预测;Constructing an initial classification model based on the labeled sample data and the hidden features, and performing label prediction on the unlabeled sample data based on the initial classification model and a preset desired step algorithm;
    根据预测结果,结合预设的最大化步骤算法对所述初始分类模型进行优化;Optimizing the initial classification model according to the prediction result in combination with a preset maximization step algorithm;
    当检测到所述预设的期望步骤算法开始收敛时,确认所述初始分类模型训练完成,并保存训练完成的所述初始分类模型。When it is detected that the preset desired step algorithm starts to converge, it is confirmed that the training of the initial classification model is completed, and the completed initial classification model is saved.
  17. 如权利要求16所述的计算机可读存储介质,当所述计算机指令在计算机上运行所述基于特征转换方法对所述离散特征进行处理,将所述离散特征转换为连续特征时,使得计算机执行如下步骤:The computer-readable storage medium according to claim 16, when the computer instructions run the feature-based conversion method on the computer to process the discrete features and convert the discrete features into continuous features, the computer is caused to execute The following steps:
    若所述离散特征具有次序关系,则对所述离散特征进行量化处理,将所述离散特征转换为连续特征;If the discrete features have an order relationship, perform quantization processing on the discrete features, and convert the discrete features into continuous features;
    若所述离散特征具有非次序关系,且所述离散特征的离散取值个数小于或等于预设个数,则基于one-hot独热编码方法对所述离散特征进行处理,将所述离散特征转换为连续特征;If the discrete feature has a non-order relationship, and the number of discrete values of the discrete feature is less than or equal to the preset number, the discrete feature is processed based on the one-hot one-hot encoding method, and the discrete Convert features to continuous features;
    若所述离散特征具有非次序关系,且所述离散特征的离散取值个数大于预设个数,则对所述离散特征进行衍生处理,将所述离散特征转换为连续特征。If the discrete feature has a non-order relationship, and the number of discrete values of the discrete feature is greater than the preset number, then the discrete feature is derivatized to convert the discrete feature into a continuous feature.
  18. 如权利要求16所述的计算机可读存储介质,当所述计算机指令在计算机上运行所述基于所述带标签样本数据和所述隐含特征构建初始分类模型, 并基于所述初始分类模型和预设的期望步骤算法对所述不带标签样本数据进行标签预测时,使得计算机执行如下步骤:The computer-readable storage medium according to claim 16, when the computer instructions are executed on a computer, the initial classification model is constructed based on the labeled sample data and the hidden features, and based on the initial classification model and When the preset expected step algorithm performs label prediction on the unlabeled sample data, the computer executes the following steps:
    基于所述带标签样本数据和所述隐含特征确定初始分类模型的初始参数π i、μ i以及∑ i,并基于所述初始参数构建初始分类模型,π i、μ i以及∑ i的初始值计算公式如下: The initial parameters π i , μ i and ∑ i of the initial classification model are determined based on the labeled sample data and the hidden features, and the initial classification model is constructed based on the initial parameters. The initial parameters of π i , μ i and ∑ i The value calculation formula is as follows:
    Figure PCTCN2019118247-appb-100022
    Figure PCTCN2019118247-appb-100022
    Figure PCTCN2019118247-appb-100023
    Figure PCTCN2019118247-appb-100023
    Figure PCTCN2019118247-appb-100024
    Figure PCTCN2019118247-appb-100024
    其中,∑为协方差矩阵,X j为样本数据,γ ij为包含隐含特征的后验概率; Among them, ∑ is the covariance matrix, X j is the sample data, and γ ij is the posterior probability containing hidden features;
    在所述初始分类模型中,通过预设的期望步骤算法对所述不带标签样本数据进行标签预测,所述预设的期望步骤算法的公式如下:In the initial classification model, label prediction is performed on the unlabeled sample data through a preset desired step algorithm, and the formula of the preset desired step algorithm is as follows:
    Figure PCTCN2019118247-appb-100025
    Figure PCTCN2019118247-appb-100025
    其中,π i为混合系数。 Among them, π i is the mixing coefficient.
  19. 如权利要求18所述的计算机可读存储介质,当所述计算机指令在计算机上运行所述根据预测结果,结合预设的最大化步骤算法对所述初始分类模型进行优化时,使得计算机执行如下步骤:The computer-readable storage medium according to claim 18, when the computer instruction is executed on the computer according to the prediction result, combined with a preset maximization step algorithm to optimize the initial classification model, so that the computer executes the following step:
    获取预设的最大化步骤算法的公式如下:The formula for obtaining the preset maximization step algorithm is as follows:
    Figure PCTCN2019118247-appb-100026
    Figure PCTCN2019118247-appb-100026
    Figure PCTCN2019118247-appb-100027
    Figure PCTCN2019118247-appb-100027
    Figure PCTCN2019118247-appb-100028
    Figure PCTCN2019118247-appb-100028
    根据预测结果,基于所述公式对所述初始分类模型的初始参数进行更新。According to the prediction result, the initial parameters of the initial classification model are updated based on the formula.
  20. 如权利要求16所述的计算机可读存储介质,当所述计算机指令在计算机上运行所述将所述连续特征和所述离散特征转换得到的连续特征输入至 自编码算法中进行降维处理,得到所述样本数据对应的隐含特征时,使得计算机执行如下步骤:16. The computer-readable storage medium according to claim 16, when the computer instructions are executed on a computer, the continuous features obtained by converting the continuous features and the discrete features are input into an auto-encoding algorithm for dimensionality reduction processing, When the hidden features corresponding to the sample data are obtained, the computer is caused to execute the following steps:
    将连续特征输入至自编码算法中进行降维处理,得到初始隐含特征;Input the continuous features into the self-encoding algorithm for dimensionality reduction to obtain the initial hidden features;
    对初始隐含特征进行解码,得到隐含特征。The initial hidden features are decoded to obtain the hidden features.
PCT/CN2019/118247 2019-09-03 2019-11-14 Classification model training method, apparatus and device, and computer-readable storage medium WO2021042556A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910826406.8A CN110705592B (en) 2019-09-03 2019-09-03 Classification model training method, device, equipment and computer readable storage medium
CN201910826406.8 2019-09-03

Publications (1)

Publication Number Publication Date
WO2021042556A1 true WO2021042556A1 (en) 2021-03-11

Family

ID=69193385

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/118247 WO2021042556A1 (en) 2019-09-03 2019-11-14 Classification model training method, apparatus and device, and computer-readable storage medium

Country Status (2)

Country Link
CN (1) CN110705592B (en)
WO (1) WO2021042556A1 (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113112346A (en) * 2021-04-30 2021-07-13 平安普惠企业管理有限公司 User classification method and device, electronic equipment and storage medium
CN113569067A (en) * 2021-07-27 2021-10-29 深圳Tcl新技术有限公司 Label classification method and device, electronic equipment and computer readable storage medium
CN113642635A (en) * 2021-08-12 2021-11-12 百度在线网络技术(北京)有限公司 Model training method and device, electronic device and medium
CN113743464A (en) * 2021-08-02 2021-12-03 昆明理工大学 Continuous characteristic discretization loss information compensation method and application thereof
CN114722943A (en) * 2022-04-11 2022-07-08 深圳市人工智能与机器人研究院 Data processing method, device and equipment
CN114742291A (en) * 2022-03-30 2022-07-12 阿里巴巴(中国)有限公司 Yaw rate prediction method, device, apparatus, readable storage medium, and program product

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111583015A (en) * 2020-04-09 2020-08-25 上海淇毓信息科技有限公司 Credit application classification method and device and electronic equipment
CN113626469B (en) * 2020-05-08 2023-10-13 中国电信股份有限公司 Internet of things equipment matching method and device
CN111611388A (en) * 2020-05-29 2020-09-01 北京学之途网络科技有限公司 Account classification method, device and equipment
CN111898738A (en) * 2020-07-30 2020-11-06 北京智能工场科技有限公司 Mobile terminal user gender prediction method and system based on full-connection neural network
CN113326889A (en) * 2021-06-16 2021-08-31 北京百度网讯科技有限公司 Method and apparatus for training a model

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8086549B2 (en) * 2007-11-09 2011-12-27 Microsoft Corporation Multi-label active learning
CN105930934A (en) * 2016-04-27 2016-09-07 北京物思创想科技有限公司 Prediction model demonstration method and device and prediction model adjustment method and device
CN107808246A (en) * 2017-10-26 2018-03-16 上海维信荟智金融科技有限公司 The intelligent evaluation method and system of collage-credit data
CN109492093A (en) * 2018-09-30 2019-03-19 平安科技(深圳)有限公司 File classification method and electronic device based on gauss hybrid models and EM algorithm
CN109902662A (en) * 2019-03-20 2019-06-18 中山大学 A kind of pedestrian recognition methods, system, device and storage medium again
US10354205B1 (en) * 2018-11-29 2019-07-16 Capital One Services, Llc Machine learning system and apparatus for sampling labelled data
CN110166454A (en) * 2019-05-21 2019-08-23 重庆邮电大学 A kind of composite character selection intrusion detection method based on self-adapted genetic algorithm
CN110163261A (en) * 2019-04-28 2019-08-23 平安科技(深圳)有限公司 Unbalanced data disaggregated model training method, device, equipment and storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104346372B (en) * 2013-07-31 2018-03-27 国际商业机器公司 Method and apparatus for assessment prediction model

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8086549B2 (en) * 2007-11-09 2011-12-27 Microsoft Corporation Multi-label active learning
CN105930934A (en) * 2016-04-27 2016-09-07 北京物思创想科技有限公司 Prediction model demonstration method and device and prediction model adjustment method and device
CN107808246A (en) * 2017-10-26 2018-03-16 上海维信荟智金融科技有限公司 The intelligent evaluation method and system of collage-credit data
CN109492093A (en) * 2018-09-30 2019-03-19 平安科技(深圳)有限公司 File classification method and electronic device based on gauss hybrid models and EM algorithm
US10354205B1 (en) * 2018-11-29 2019-07-16 Capital One Services, Llc Machine learning system and apparatus for sampling labelled data
CN109902662A (en) * 2019-03-20 2019-06-18 中山大学 A kind of pedestrian recognition methods, system, device and storage medium again
CN110163261A (en) * 2019-04-28 2019-08-23 平安科技(深圳)有限公司 Unbalanced data disaggregated model training method, device, equipment and storage medium
CN110166454A (en) * 2019-05-21 2019-08-23 重庆邮电大学 A kind of composite character selection intrusion detection method based on self-adapted genetic algorithm

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113112346A (en) * 2021-04-30 2021-07-13 平安普惠企业管理有限公司 User classification method and device, electronic equipment and storage medium
CN113569067A (en) * 2021-07-27 2021-10-29 深圳Tcl新技术有限公司 Label classification method and device, electronic equipment and computer readable storage medium
CN113743464A (en) * 2021-08-02 2021-12-03 昆明理工大学 Continuous characteristic discretization loss information compensation method and application thereof
CN113743464B (en) * 2021-08-02 2023-09-05 昆明理工大学 Continuous characteristic discretization loss information compensation method and application thereof
CN113642635A (en) * 2021-08-12 2021-11-12 百度在线网络技术(北京)有限公司 Model training method and device, electronic device and medium
CN113642635B (en) * 2021-08-12 2023-09-15 百度在线网络技术(北京)有限公司 Model training method and device, electronic equipment and medium
CN114742291A (en) * 2022-03-30 2022-07-12 阿里巴巴(中国)有限公司 Yaw rate prediction method, device, apparatus, readable storage medium, and program product
CN114722943A (en) * 2022-04-11 2022-07-08 深圳市人工智能与机器人研究院 Data processing method, device and equipment

Also Published As

Publication number Publication date
CN110705592B (en) 2024-05-14
CN110705592A (en) 2020-01-17

Similar Documents

Publication Publication Date Title
WO2021042556A1 (en) Classification model training method, apparatus and device, and computer-readable storage medium
CN113326764B (en) Method and device for training image recognition model and image recognition
US11436414B2 (en) Device and text representation method applied to sentence embedding
US10885344B2 (en) Method and apparatus for generating video
US20190103091A1 (en) Method and apparatus for training text normalization model, method and apparatus for text normalization
WO2021143396A1 (en) Method and apparatus for carrying out classification prediction by using text classification model
WO2021051598A1 (en) Text sentiment analysis model training method, apparatus and device, and readable storage medium
US20220415195A1 (en) Method for training course recommendation model, method for course recommendation, and apparatus
CN112119388A (en) Training image embedding model and text embedding model
CN112074828A (en) Training image embedding model and text embedding model
CN113051914A (en) Enterprise hidden label extraction method and device based on multi-feature dynamic portrait
US20230342606A1 (en) Training method and apparatus for graph neural network
WO2023123933A1 (en) User type information determination method and device, and storage medium
CN113656373A (en) Method, device, equipment and storage medium for constructing retrieval database
CN111191825A (en) User default prediction method and device and electronic equipment
CN115810068A (en) Image description generation method and device, storage medium and electronic equipment
CN113837307A (en) Data similarity calculation method and device, readable medium and electronic equipment
CN111950647A (en) Classification model training method and device
WO2023011062A1 (en) Information pushing method and apparatus, device, storage medium, and computer program product
CN116092101A (en) Training method, image recognition method apparatus, device, and readable storage medium
US20230336739A1 (en) Rate control machine learning models with feedback control for video encoding
CN116108127A (en) Document level event extraction method based on heterogeneous graph interaction and mask multi-head attention mechanism
CN115599953A (en) Training method and retrieval method of video text retrieval model and related equipment
CN114491030A (en) Skill label extraction and candidate phrase classification model training method and device
CN113627514A (en) Data processing method and device of knowledge graph, electronic equipment and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19944218

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19944218

Country of ref document: EP

Kind code of ref document: A1