CN114997322A - Semi-supervised model-based classification prediction method, equipment and storage medium - Google Patents

Semi-supervised model-based classification prediction method, equipment and storage medium Download PDF

Info

Publication number
CN114997322A
CN114997322A CN202210688575.1A CN202210688575A CN114997322A CN 114997322 A CN114997322 A CN 114997322A CN 202210688575 A CN202210688575 A CN 202210688575A CN 114997322 A CN114997322 A CN 114997322A
Authority
CN
China
Prior art keywords
sample set
sample
trained
semi
labeled
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210688575.1A
Other languages
Chinese (zh)
Inventor
萧梓健
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Life Insurance Company of China Ltd
Original Assignee
Ping An Life Insurance Company of China Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Life Insurance Company of China Ltd filed Critical Ping An Life Insurance Company of China Ltd
Priority to CN202210688575.1A priority Critical patent/CN114997322A/en
Publication of CN114997322A publication Critical patent/CN114997322A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a classification prediction method, equipment and a storage medium based on a semi-supervised model, relating to the technical field of computers; the method comprises the following steps: according to a preset label setting algorithm, setting a pseudo label for each unlabeled sample in the unlabeled sample set in the first sample set to obtain a second labeled sample set; collaborating the second set of annotated samples and the set of first set of annotated samples in the first set of samples as a second set of samples; combining a plurality of samples to be trained, which are obtained by random sampling during the process of dividing the second sample set for multiple times, to obtain a sample set to be trained; carrying out classification training on a preset semi-supervised model based on a sample set to be trained to obtain a trained classification prediction model; and carrying out classification prediction on the acquired platform service data through a classification prediction model to obtain a classification score corresponding to the platform service data. By applying the method to the equipment and the storage medium, the classification prediction model provided by the embodiment of the invention has better compatibility and higher applicability in actual application scenes.

Description

Semi-supervised model based classification prediction method, equipment and storage medium
Technical Field
The embodiment of the invention relates to the technical field of computers, in particular to a classification prediction method, equipment and a storage medium based on a semi-supervised model.
Background
Most of the existing machine learning models are supervised training based on supervised models with labeled data sets, but most of real business data often have a large number of unlabeled data sets, and effective learning cannot be performed based on the supervised models. For such cases, in the related art, semi-supervised training is performed based on a semi-supervised model to improve the semi-supervised model by combining an unlabelled data set, but the semi-supervised models generally have the problems of dependence on specific types of data or great influence of introduced data deviation, and the like, so that the trained semi-supervised model is difficult to use in a real production environment.
Disclosure of Invention
The following is a summary of the subject matter described in detail herein. This summary is not intended to limit the scope of the claims.
The embodiment of the invention provides a classification prediction method, equipment and a storage medium based on a semi-supervised model, which can improve the applicability of the classification prediction model in the actual application scene.
In a first aspect, an embodiment of the present invention provides a classification prediction method based on a semi-supervised model, including:
acquiring a first sample set, wherein the first sample set comprises a first labeled sample set and an unlabeled sample set;
according to a preset label setting algorithm, setting a pseudo label for each unmarked sample in the unmarked sample set to obtain a second marked sample set;
collaborate the set of the first set of annotated samples and the second set of annotated samples as a second set of samples;
dividing the second sample set for multiple times, and combining multiple samples to be trained obtained by random sampling in each division processing to obtain a sample set to be trained;
performing classification training on a preset semi-supervised model based on the sample set to be trained to obtain a trained classification prediction model;
and carrying out classification prediction on the acquired platform service data through the classification prediction model to obtain a classification score corresponding to the platform service data.
According to some embodiments of the first aspect of the present invention, the setting, according to a preset label setting algorithm, a pseudo label for each unlabeled sample in the unlabeled sample set to obtain a second labeled sample set includes:
determining an initial label of the unmarked sample set according to the label information of the first marked sample set;
and setting a pseudo label for each unlabeled sample according to the initial label to obtain a second labeled sample set.
According to some embodiments of the first aspect of the present invention, after obtaining the second set of annotated samples, the method further comprises:
acquiring service associated data of each second labeling sample in the second labeling sample set;
and adjusting the pseudo label of the second labeled sample according to the business correlation of the business correlation data and a preset business target, so that the value of the pseudo label of the second labeled sample with high business correlation is larger than the value of the pseudo label of the second labeled sample with low business correlation.
According to some embodiments of the first aspect of the present invention, the adjusting the pseudo label of the second labeled sample according to the service correlation between the service related data and a preset service target includes:
performing correlation analysis on the service correlation data and a preset service target to obtain service correlation degrees corresponding to the second labeled samples;
sequencing the second labeled samples according to the descending numerical value of the business relevance of the second labeled samples;
and adjusting the value of the pseudo label of each second labeled sample according to the sequence of the second labeled sample.
According to some embodiments of the first aspect of the present invention, the obtaining the business related data of each second labeled sample in the second labeled sample set includes:
acquiring service data corresponding to each second labeled sample in a preset time period;
and summarizing the service data to serve as service associated data of the corresponding second labeling sample.
According to some embodiments of the first aspect of the present invention, the dividing the second sample set for multiple times, and combining multiple samples to be trained obtained by random sampling at each dividing process to obtain a sample set to be trained includes:
determining a value of a random seed used to partition the second sample set to be K;
selecting K sample data from the second sample set as a sample to be trained respectively until the second sample set is divided into a preset number of samples to be trained;
changing the value of the random seed, and re-dividing the second sample set by the changed random seed until the number of the samples to be trained meets a preset sample number;
and combining a plurality of samples to be trained obtained by dividing the second sample set to obtain a sample set to be trained.
According to some embodiments of the first aspect of the present invention, before obtaining the sample set to be trained, the classification prediction method further includes:
performing multi-level target division processing on the first labeled sample set, so that each first labeled sample in the first labeled sample set corresponds to a level target; wherein a higher value of the hierarchical target indicates a higher correlation with a preset business target;
correspondingly, the classification training of the preset semi-supervised model based on the sample set to be trained to obtain the trained classification prediction model comprises:
and carrying out classification training on a preset semi-supervised model based on the sample set to be trained and the hierarchical target corresponding to each first labeling sample to obtain a trained classification prediction model.
According to some embodiments of the first aspect of the present invention, before obtaining the sample set to be trained, the classification prediction method further includes:
performing multi-level target division processing on the second labeled sample set, so that each second labeled sample in the second labeled sample set corresponds to a predicted level target;
correspondingly, the classification training is performed on a preset semi-supervised model based on the sample set to be trained and the level target corresponding to each labeled sample, so as to obtain a trained classification prediction model, and the method comprises the following steps:
and carrying out classification training on a preset semi-supervised model based on the sample set to be trained, the hierarchical target corresponding to each first labeling sample and the hierarchical target corresponding to each second labeling sample to obtain a trained classification prediction model.
In a second aspect, an embodiment of the present invention further provides an electronic device, including: at least one processor, and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions for execution by the at least one processor to cause the at least one processor, when executing the instructions, to implement the semi-supervised model based classification prediction method as recited in any one of the first aspects.
In a third aspect, an embodiment of the present invention further provides a computer-readable storage medium, which stores computer-executable instructions, where the computer-executable instructions are configured to execute the semi-supervised model based classification prediction method described in any one of the first aspects.
The above embodiment of the invention has at least the following beneficial effects: by randomly sampling the second sample set during multiple division processing, the proportion of the first labeled sample of each sample to be trained in the sample set to be trained to the unlabeled sample provided with the pseudo label is random, so that the sample set to be trained for training is more diverse, and the dependency of the semi-supervised model on specific data in the training process is reduced. Meanwhile, by setting the pseudo labels of the unlabeled samples, the semi-supervised model can manage each second labeled sample in the second sample set through the label value, the training efficiency is improved, and then more sample sets to be trained can be adopted for training, so that the classification prediction model obtained by training based on the embodiment of the invention has better compatibility, can be deployed in a real production environment, and is more suitable for classification prediction of an actual application scene.
Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.
Drawings
The accompanying drawings are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the example serve to explain the principles of the invention and not to limit the invention.
FIG. 1 is a flow chart of a semi-supervised model based classification prediction method according to an embodiment of the present invention;
FIG. 2 is a schematic composition diagram of a second sample set according to an embodiment of the present invention;
FIG. 3 is a schematic flow chart of fine setting of pseudo labels in a classification prediction method based on a semi-supervised model according to an embodiment of the present invention;
FIG. 4 is a flow chart illustrating the sub-steps of step S400 in the classification prediction method based on semi-supervised model according to the embodiment of the present invention;
FIG. 5 is a schematic composition diagram of a sample set to be trained in the classification prediction method based on the semi-supervised model according to the embodiment of the present invention;
FIG. 6 is a block diagram of an apparatus corresponding to a classification prediction method based on a semi-supervised model according to an embodiment of the present invention;
fig. 7 is a schematic diagram of a hardware structure of an electronic device according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used herein is for the purpose of describing embodiments of the invention only and is not intended to be limiting of the invention.
Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the disclosure. One skilled in the relevant art will recognize, however, that the embodiments of the disclosure can be practiced without one or more of the specific details, or with other methods, components, devices, steps, and the like. In other instances, well-known methods, devices, implementations, or operations have not been shown or described in detail to avoid obscuring aspects of the disclosure.
The block diagrams shown in the figures are functional entities only and do not necessarily correspond to physically separate entities. I.e. these functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor means and/or microcontroller means.
The flow charts shown in the drawings are merely illustrative and do not necessarily include all of the contents and operations/steps, nor do they necessarily have to be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.
In each embodiment of the present invention, when data related to the identity or the characteristic of the user, such as user information, user behavior data, user history data, and user location information, is processed, permission or consent of the user is obtained, and the data collection, use, and processing, etc., comply with relevant laws and regulations and standards of relevant countries and regions. In addition, when the embodiment of the present invention needs to acquire sensitive personal information of a user, individual permission or individual consent of the user is obtained through a pop-up window or a jump to a confirmation page, and after the individual permission or individual consent of the user is definitely obtained, necessary user-related data for enabling the embodiment of the present invention to normally operate is acquired.
The following is an explanation of some terms used in the present invention.
Semi-supervised learning: semi-supervised learning is between supervised and unsupervised learning. Semi-supervised models aim to use a small amount of labeled training data and a large amount of unlabeled training data. Typically used in situations where tag data is expensive and or has a constant data flow.
The supervised training is to train through an existing training sample (namely, known data and corresponding output thereof) to obtain an optimal model (the model belongs to a certain function set, and the optimal model represents that the model is optimal under a certain evaluation criterion), then use the model to map all inputs into corresponding outputs, and simply judge the outputs to realize the classification purpose, so that the capability of classifying unknown data is realized. Typical examples are KNN, SVM.
Unsupervised training (or unsupervised training) is another. It differs from supervised training in that we do not have any training samples in advance, but need to model the data directly. A typical example of unsupervised training is clustering.
Most of the existing machine learning models are supervised training based on supervised models with labeled data sets, but most of real business data often have a large number of unlabeled data sets, and effective learning cannot be performed based on the supervised models. For such cases, in the related art, semi-supervised training is performed based on a semi-supervised model to improve the semi-supervised model by combining an unlabelled data set, but the semi-supervised models generally have the problems of dependence on specific types of data or great influence of introduced data deviation, and the like, so that the trained semi-supervised model is difficult to use in a real production environment. Based on this, the embodiment of the invention provides a classification prediction method, equipment and a storage medium based on a semi-supervised model, which can improve the applicability of the classification prediction model in the actual application scene.
In a first aspect, referring to fig. 1, an embodiment of the present invention provides a classification prediction method based on a semi-supervised model, including:
step S100, a first sample set is obtained, wherein the first sample set comprises a first labeled sample set and an unlabeled sample set.
It should be noted that the first labeled sample set includes a plurality of first samples with label values set, and the unlabeled sample set also includes a plurality of unlabeled samples with label values not set.
And S200, setting a pseudo label for each unmarked sample in the unmarked sample set according to a preset label setting algorithm to obtain a second marked sample set.
It should be noted that by setting the pseudo label, the unlabeled sample set can be trained in the semi-supervised model, and the trained classification prediction model can classify unlabeled service data.
It should be noted that the label setting algorithm may be a neural network-based model, a clustering algorithm, or the like. The embodiments of the present application are not limited thereto. It should be noted that the pseudo tag value of the unlabeled exemplar is different from the tag value of the labeled exemplar.
And step S300, combining the first labeled sample set and the second labeled sample set into a second sample set.
It should be noted that each first labeled sample in the first labeled sample set and a second labeled sample (corresponding to a non-labeled sample) in the second labeled sample set may be combined according to the size of the label value to obtain a second sample set, or the first labeled sample and the second labeled sample may be randomly mixed to obtain a second labeled sample set combined in an out-of-order manner. Therefore, the embodiments of the present invention are not limited, and are not described herein in detail.
Illustratively, the composition of the second sample set is shown with reference to FIG. 2.
And S400, carrying out multiple division processing on the second sample set, and combining a plurality of samples to be trained obtained by random sampling in each division processing to obtain a sample set to be trained.
It should be noted that the dividing process indicates that the second sample set is randomly sampled for multiple times until the remaining second samples in the second sample set cannot be combined to form a new training sample or the number of samples reaches a preset number. Because each sample to be trained is obtained by random sampling, the proportion of the unlabeled sample to the labeled sample in each sample to be trained is different, and the label value in each sample to be trained is different. At this time, the enhancement of the samples can be realized by performing multiple division processing on the second sample set, so that the data volume of the sample set to be trained is more and more diverse.
And S500, carrying out classification training on a preset semi-supervised model based on a sample set to be trained to obtain a trained classification prediction model.
It should be noted that, the embodiment of the present invention does not limit options of the semi-supervised model. The person skilled in the art can choose a suitable semi-supervised model for training based on common general knowledge, such as a pairwise-based model, and preferably, the lambdarak model is adopted in the embodiment of the present invention.
And S600, carrying out classification prediction on the collected platform service data through a classification prediction model to obtain a classification score corresponding to the platform service data.
It should be noted that the platform business data is the business data generated on the classification prediction model application system for prediction evaluation. The classification score is the basis for performing two-classification division on the service data. When the number of the objects needing to be classified is more than one, a plurality of classification scores are output, and when the number of the objects needing to be classified is only one, one classification score is output. Taking the classification prediction model for predicting the types of students and taking the excellent degree of the students as the business target, for the students A to E to be classified and predicted, the platform business data comprises the business data generated by the students A to E, and the classification prediction model correspondingly outputs the classification scores corresponding to the students A to E, namely A-10, B-5, C-4, D-6 and E-8; at this time, two kinds of management can be performed on the students a to E based on the distribution of the classification scores, and if the classification score of 6 or more is set as an excellent student, the student a, the student D, and the student E are excellent students, and the others are general students.
Therefore, by randomly sampling the second sample set during multiple division processing, the proportion of the first labeled sample of each sample to be trained in the sample set to be trained to the unlabeled sample provided with the pseudo label is random, so that the sample set to be trained for training is more diverse, and the dependency of the semi-supervised model on specific data in the training process is reduced. Meanwhile, by setting the pseudo labels of the unlabeled samples, the semi-supervised model can manage each second sample in the second sample set through the label value, the training efficiency is improved, and then more sample sets to be trained can be adopted for training, so that the classification prediction model obtained by training based on the embodiment of the invention has better compatibility, can be deployed in a real production environment, and is more suitable for classification prediction of an actual application scene.
It can be understood that, in step S200, according to a preset label setting algorithm, a pseudo label is set for each unlabeled sample in the unlabeled sample set to obtain a second labeled sample set, including: determining an initial label of the unlabeled sample set according to the label information of the first labeled sample set; and according to the initial label, setting a pseudo label for each unmarked sample to obtain a second marked sample set.
It should be noted that, the label value of the first labeled sample in the first labeled sample set is a positive integer, the pseudo label value is also set as a positive integer, the label value and the pseudo label value of the first labeled sample are different, and the value of the pseudo label of each unlabeled sample is different; the label values of the first labeled sample and the unlabeled sample are set to be positive integers, so that management is facilitated. It should be noted that the initial label of the unlabeled sample set may be determined according to the maximum label value of the first labeled sample (i.e., the label information corresponding to the first labeled sample set).
It should be noted that the value of the pseudo label may be a continuously changing positive integer or a regularly changing positive integer, and different unlabeled samples are distinguished by different values of the pseudo label. The embodiments of the present invention are not limited thereto.
For example, taking the value of the pseudo label as a continuously changing positive integer as an example, assuming that the label value of the lower limit of the sample target for determining the unlabeled sample is L according to the label information of the first labeled sample set s Herein, thisWhen the initial label of the unlabeled sample is L s +1, therefore, for an unlabeled sample set, the nth unlabeled sample in the unlabeled sample set is L s + n. At this time, according to the order of each unlabeled sample in the unlabeled sample set, the label value L can be initialized s +1~L s + m of the second set of annotated samples, where m is the number of samples of unlabeled samples.
It can be understood that, referring to fig. 3, after step S200, the semi-supervised model based classification prediction method further includes:
step S700, obtaining service associated data of each second labeling sample in the second labeling sample set.
It should be noted that the business-related data is data information related to user behavior on the application system, for example, data related to case writing of an agent and data related to identity information of the agent may be generated on the agent case tracking system, and specific data related to case writing, such as case writing amount, return amount, and return time; agent identity information such as: certificate status, etc.; then the case writing amount, the manuscript returning time and the like in a period of time are collected to be used as service associated data; when the set business target is the agent grade classification, the semi-supervised model can be trained based on the business association data, and then the classification grade corresponding to each agent can be continuously updated and predicted based on the classification prediction model, so that the agents are classified.
Step S800, according to the business correlation between the business correlation data and the preset business target, the pseudo label of the second labeled sample is adjusted, so that the value of the pseudo label of the second labeled sample with high business correlation is larger than the value of the pseudo label of the second labeled sample with low business correlation.
It should be noted that the service correlation corresponds to a service correlation interval of the service target, and the service data is divided according to the service correlation interval, where if the service correlation is set to be 30% to 40% of the correlation interval, the service correlation is lower, and if the service correlation is set to be 50% to 60% of the correlation interval, the service correlation is lower. In the embodiment of the present invention, the value of the pseudo tag in the range of the degree of correlation corresponding to the traffic correlation of 50% to 60% is greater than the value of the pseudo tag in the range of the degree of correlation corresponding to the traffic correlation of 30% to 40%.
It should be noted that, in the embodiment of the present invention, the value of the pseudo tag of the second labeled sample is associated with the business relevance, so as to implement fine setting of the pseudo tag, thereby reducing interference of the business data with low relevance to the business target, and implementing more accurate prediction classification while optimizing the training efficiency of the semi-supervised model.
It can be understood that, in step S800, the adjusting the pseudo tag of the second labeled sample according to the business correlation between the business related data and the preset business target includes: performing correlation analysis on the service correlation data and a preset service target to obtain service correlation degrees corresponding to the second labeled samples; sequencing the second labeled samples according to the descending numerical value of the business relevance of the second labeled samples; and adjusting the value of the pseudo label of each second labeled sample according to the sequence of each second labeled sample.
It should be noted that the correlation analysis may be a clustering algorithm or a neural network model to obtain the association degree with the preset service target. For example, based on certain sample data, the preset neural network model is trained for service correlation, so that the input service correlation data can be divided for service correlation. In other embodiments, the service-related data may also be clustered to obtain multiple classified data, and the relevance of the multiple classified data is set. The embodiments of the present invention are not limited thereto.
It should be noted that after the second labeled samples are sorted according to the business relevance, the business relevance interval corresponding to each second labeled sample can be quickly determined, and then which second labeled samples need to adjust the value of the pseudo label can be determined. It should be noted that the value of the pseudo label may be continuous or discontinuous, so during the adjustment, the value of the pseudo label of the second labeled sample that needs to be adjusted may be directly set to a larger value to ensure that the value of the pseudo label with low correlation is smaller than the value of the pseudo label with high correlation, or the value of the pseudo label of the second labeled sample with high business correlation may be directly exchanged with the value of the pseudo label of the second labeled sample with low business correlation, and after multiple rounds of exchange, the value of the pseudo label of the second labeled sample with high business correlation is higher than the value of the pseudo label of the second labeled sample with low business correlation.
Illustratively, still taking the agent management system as an example, also agents remaining within 3 months, an agent that turned right within 3 months is more relevant than an agent that did not turn right within three months, relative to the business objective of the agent rating, and therefore the false tag T of the turning-right agent within 3 months s1 All are greater than 3 months within the non-righting agent pseudo label T s2
It can be understood that the obtaining of the service related data of each second labeled sample in the second labeled sample set includes: acquiring service data corresponding to each second labeled sample in a preset time period; and summarizing the service data to be used as service associated data of the corresponding second labeling sample.
It should be noted that the service data acquired within the preset time period is the service data generated in the actual application process, so that the semi-supervised model can be trained based on the actual application scenario through the service data, and the training result is more biased to the real situation.
It should be noted that, for any system, a first labeled sample set may be generated by labeling a small amount of generated service data and obtaining the first sample set by combining with other unlabeled service data.
It should be noted that, taking the client system as an example, for a newly added user, the generated service data can be regarded as belonging to unmarked data; and generating corresponding unlabeled samples according to the unlabeled data.
It should be noted that the preset time period may be dynamically set by a user, or may be an observation period set according to a service data generation amount in an actual system; the embodiments of the present invention are not limited thereto.
It should be noted that, when the trained model is applied to a system, the hierarchical information corresponding to the service data can be predicted by collecting the service data in the preset time period, performing pseudo label setting on the service data, and inputting the service data into the classification prediction model, so as to judge the two classification information corresponding to the service data according to the hierarchical information.
It can be understood that, referring to fig. 4, in step S400, performing multiple division processes on the second sample set, and combining multiple samples to be trained obtained by random sampling in each division process to obtain a sample set to be trained, the method includes:
and step S410, determining the value of the random seed used for dividing the second sample set to be K.
And step S420, selecting K sample data from the second sample set as a sample to be trained respectively until the second sample set is divided into a preset number of samples to be trained.
And S430, replacing the value of the random seed, and reclassifying the second sample set through the replaced random seed until the number of the samples to be trained meets the preset sample number.
And step S440, combining a plurality of samples to be trained obtained by dividing the second sample set to obtain a sample set to be trained.
It should be noted that the preset number may be set by a user, or may be set as a ratio of the number of samples in the second sample set to K. If the number of the samples in the second sample set is 100 and K is 7, the preset number is 15; wherein the number of each sample to be trained is 7. Preferably, in the embodiment of the present invention, the ratio of the number of samples to K is used for rounding.
It should be noted that, when N divisions need to be performed, N copies of the second sample set are performed to obtain N second sample sets, and at this time, each second sample set corresponds to one random seed K, and values of each random seed K are different. For example, referring to fig. 5, combining the samples to be trained obtained by dividing N times to obtain a sample set to be trained.
It can be understood that, before obtaining the sample set to be trained, the classification prediction method further includes:
performing multi-level target division processing on the first labeled sample set to enable each first labeled sample in the first labeled sample set to correspond to a level target; wherein, the higher the value of the hierarchical target, the higher the correlation with the preset business target;
correspondingly, step S500, performing classification training on a preset semi-supervised model based on a sample set to be trained to obtain a trained classification prediction model, including:
and carrying out classification training on the preset semi-supervised model based on the sample set to be trained and the hierarchical target corresponding to each first labeling sample to obtain a trained classification prediction model.
It should be noted that the first labeled sample set is provided with a plurality of hierarchical targets, and the first labeled sample is finely set to strengthen the first labeled sample with high correlation with the business target. Therefore, when the semi-supervised model is trained, the business association degree of the second labeled sample can be rapidly adjusted according to the level target and the first labeled sample, so that the level division of the second labeled sample is determined, and classification is realized.
It should be noted that, the hierarchical target represents the correlation between the sample and the service target, and corresponds to the service correlation interval with the service target, and the division mode may be the same as or different from the division of the service correlation interval of the pseudo tag; if the target of the hierarchy is provided with 5 intervals, the pseudo label is finely arranged according to 6 intervals. At this moment, as the related regions of the pseudo labels are more finely divided, the change of the pseudo label values can also correspond to the hierarchical targets one by one, and the value of the pseudo labels can be set while the efficiency of semi-supervised model training is improved.
It should be noted that the number of the hierarchical targets is not limited in the embodiment of the present application, and taking the example that the hierarchical target is set to 5 levels, the hierarchical target is divided into five sections of higher, middle, lower and lower, and the corresponding correlation degrees are [ 95%, 100% ], [ 85% -95%, [ 65% -85% ], [ 45% -65% ], [ 15%, 45%), [0, 15%), respectively. Preferably, in the embodiment of the present invention, the correlation intervals set by the hierarchical target and the pseudo tag are set to be consistent.
It can be understood that, before obtaining the sample set to be trained, the classification prediction method further includes:
performing multi-level target division processing on the second labeled sample set, so that each second labeled sample in the second labeled sample set corresponds to a predicted level target;
correspondingly, step S500, based on the sample set to be trained and the level target corresponding to each labeled sample, performing classification training on a preset semi-supervised model to obtain a trained classification prediction model, including:
and carrying out classification training on the preset semi-supervised model based on the sample set to be trained, the hierarchical target corresponding to each first labeled sample and the hierarchical target corresponding to each second labeled sample to obtain a trained classification prediction model.
It should be noted that, the unlabeled sample set is subjected to hierarchical division and then used as the input of the semi-supervised model, at this time, the training efficiency of the semi-supervised model can be improved, and the semi-supervised model can perform hierarchical target adjustment on the second labeled sample based on the input hierarchical target, so as to realize efficient classification. In practical applications, the output level target corresponds to the classification score in step S700.
It should be noted that the multi-level target division processing of the second labeled sample set is divided with reference to the first labeled sample set. Because the two follow the principle of multi-level target division consistent with each other, in the semi-supervised model, the hierarchical target of the second labeled sample can be more accurately adjusted by referring to the hierarchical target of the first labeled sample, so as to ensure that the classification of each unlabeled sample is more accurate.
In summary, the semi-supervised model based classification prediction method in the embodiment of the present invention is based on multi-level pseudo labels and a semi-supervised model architecture, and can implement the construction of input data of the semi-supervised model by simply modifying a training set. Compared with other semi-supervised models, the training method is simple, does not depend on specific types of data, reduces risks caused by data deviation, and is suitable for on-line use in actual production.
Specifically, taking the smart client system as an example, the training process is as follows:
first, a training set is constructed, and referring to step S100, a first sample set is obtained from the smart client system. The multi-level objects are then partitioned for the first labeled sample in the first sample set, specifically, the number of level objects is set to 6. At this point, the first labeled sample set divided into 6 levels can be obtained. Then, referring to the substep of the step S200 and the steps S600 to S700, performing pseudo label setting on the sample set which is not marked in the first sample set; and setting the value of the pseudo label of each unmarked sample in the unmarked sample set in a descending order according to the relevancy interval corresponding to the business relevancy to obtain a second marked sample set. At this time, the second labeled sample set is divided into multiple hierarchical targets, the hierarchical targets are set for the second labeled sample set, and the first labeled sample set and the second labeled sample set are combined to obtain the second sample set with reference to step S300. At this time, each second labeled sample in the second sample set corresponds to a label value and a hierarchical target. At this time, referring to step S400 and the substeps thereof, performing sample grouping and data enhancement on the second sample set, specifically, copying the second sample set N times, setting a different random seed for each second sample set, and determining the number of samples randomly sampled for the corresponding second sample set each time by using the random seeds, thereby obtaining g samples to be trained corresponding to each second sample set; where g is Num/k, Num is the number of the second sample set, and k is the value of the random seed. Therefore, a preset number of samples to be trained can be obtained in the above manner. And the semi-supervised model is trained according to the step S500 to obtain a trained classification prediction model.
And issuing the classification prediction model on an intelligent customer service system, acquiring platform service data corresponding to a user in an observation period when the user has data update, inputting the platform service data into the classification prediction model, and continuously updating the classification information of the user so as to perform continuous classification prediction and classification tracking on the user.
At this time, the smart client system may push relevant information to the user or administrator based on the predicted classification information.
It should be appreciated that the present invention is operational with numerous general purpose or special purpose computing system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet-type devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like. The invention may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
It can be understood that, referring to the embodiment shown in fig. 6, the embodiment of the present invention further provides an electronic device, which includes:
the acquisition module 100, the acquisition module 100 is configured to acquire a first sample set, where the first sample set includes a first labeled sample set and an unlabeled sample set;
the pseudo label setting module 200, the pseudo label setting module 200 is configured to set a pseudo label for each unlabeled sample in the unlabeled sample set according to a preset label setting algorithm, so as to obtain a second labeled sample set;
a sample set module 300, the sample set module 300 configured to collaborate the first set of annotated samples and the second set of annotated samples as a second set of samples;
a training sample obtaining module 400, wherein the training sample obtaining module 400 is configured to perform multiple division processing on the second sample set, and combine multiple samples to be trained obtained by random sampling during each division processing to obtain a sample set to be trained;
the training module 500 is used for carrying out classification training on a preset semi-supervised model based on a sample set to be trained to obtain a trained classification prediction model;
and the prediction module 600, the prediction module 600 is configured to perform classification prediction on the acquired platform service data through a classification prediction model to obtain a classification score corresponding to the platform service data.
It should be noted that, in some embodiments, the collection module 100 collects the business data of the actual application system. The pseudo label setting module 200, the sample set module 300, the training sample obtaining module 400, and the training module 500 may be remotely connected to the acquisition module 100, and when the training of the training module 500 is completed, the classification prediction model is issued to the system platform where the acquisition module 100 is located, so as to perform classification prediction of practical applications.
It should be noted that, in some embodiments, the electronic device further includes a storage module, and the storage module is configured to store the training sample set for the semi-supervised model to train. It should be noted that, in some embodiments, the system further includes a publishing module, where the publishing module is configured to publish the trained classification prediction model to a specified system platform, and then perform real-time classification prediction through the classification prediction model.
Since the electronic device executes the semi-supervised model based classification prediction method of the first aspect, all the advantages of the semi-supervised model based classification prediction method of the first aspect are achieved.
An embodiment of the present invention further provides an electronic device, including:
at least one processor, and,
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions that are executed by the at least one processor to cause the at least one processor, when executing the instructions, to implement the semi-supervised model based classification prediction method as in the above-described embodiments of the present invention.
The hardware structure of the computer apparatus will be described in detail below with reference to fig. 7. The electronic device includes: a processor 710, a memory 720, an input/output interface 730, a communication interface 740, and a bus 750.
The processor 710 may be implemented by a general CPU (Central processing Unit), a microprocessor, an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits, and is configured to execute a relevant program to implement the technical solution provided by the embodiment of the present disclosure;
the Memory 720 may be implemented in the form of a ROM (Read Only Memory), a static Memory device, a dynamic Memory device, or a RAM (Random Access Memory). The memory 720 may store an operating system and other application programs, and when the technical solution provided by the embodiments of the present disclosure is implemented by software or firmware, the relevant program codes are stored in the memory 720 and called by the processor 710 to execute the classification prediction method of the model of the embodiments of the present disclosure;
an input/output interface 730 for implementing information input and output;
the communication interface 740 is configured to implement communication interaction between the apparatus and another apparatus, and may implement communication in a wired manner (e.g., USB, network cable, etc.) or in a wireless manner (e.g., mobile network, WIFI, bluetooth, etc.); and a bus 750 that transfers information between various components of the device (e.g., processor 710, memory 720, input/output interface 730, and communication interface 740);
processor 710, memory 720, input/output interface 730, and communication interface 740 are communicatively coupled to each other within the device via bus 750.
Specifically, referring to the embodiment shown in fig. 1, the processor 710 performs the steps S100 to S600 to perform the classification prediction. Specifically, the processor 710 further executes the sub-step of step S200 to initialize the pseudo tag, and further, referring to the embodiment shown in fig. 3, the processor 710 further executes steps S700 to S800 to perform fine setting of the pseudo tag.
The memory 720, which is a non-transitory computer readable storage medium, may be used to store non-transitory software programs as well as non-transitory computer executable programs. Further, the memory 720 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory optionally includes memory located remotely from the processor, and these remote memories may be connected to the processor through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
It can be understood that the present invention also provides a computer-readable storage medium storing computer-executable instructions for performing the semi-supervised model based classification prediction method described above.
One of ordinary skill in the art will appreciate that all or some of the steps, systems, and methods disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, as is well known to those of ordinary skill in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can accessed by a computer. In addition, communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media as known to those skilled in the art.
It should be noted that, since the computer storage medium stores the semi-supervised model based classification prediction method of the first aspect, it has all the advantages of the semi-supervised model based classification prediction method of the first aspect.
Specifically, the computer storage medium may store program instructions corresponding to steps S100 to S600 shown in fig. 1, in some embodiments, further store sub-steps S200, and in some embodiments, further store program instructions corresponding to steps S700 to S800. When the terminal equipment loads the program instruction, training can be carried out through the input first sample set, and then a trained classification prediction model is obtained and issued on a system to carry out continuous classification updating prediction.
The embodiment described in the embodiment of the present invention is for more clearly illustrating the technical solution of the embodiment of the present invention, and does not constitute a limitation to the technical solution provided in the embodiment of the present invention, and it can be known by those skilled in the art that the technical solution provided in the embodiment of the present invention is also applicable to similar technical problems with the evolution of technology and the occurrence of new application scenarios.
One of ordinary skill in the art will appreciate that all or some of the steps of the methods, systems, functional modules/units in the devices disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof.
The terms "comprises," "comprising," and any other variation thereof, in the description of the present invention are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
It is to be understood that, in the present invention, "at least one" means one or more, "a plurality" means two or more. "and/or" is used to describe the association relationship of the associated object, indicating that there may be three relationships, for example, "a and/or B" may indicate: only A, only B and both A and B are present, wherein A and B may be singular or plural. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. "at least one of the following" or similar expressions refer to any combination of these items, including any combination of single item(s) or plural items. For example, at least one (one) of a, b, or c, may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.
While the preferred embodiments of the present invention have been described in detail, it will be understood by those skilled in the art that the foregoing and various other changes, omissions and deviations in the form and detail thereof may be made without departing from the scope of this invention.

Claims (10)

1. A classification prediction method based on a semi-supervised model is characterized by comprising the following steps:
acquiring a first sample set, wherein the first sample set comprises a first labeled sample set and an unlabeled sample set;
according to a preset label setting algorithm, setting a pseudo label for each unmarked sample in the unmarked sample set to obtain a second marked sample set;
collaborate the set of the first set of annotated samples and the second set of annotated samples as a second set of samples;
dividing the second sample set for multiple times, and combining multiple samples to be trained obtained by random sampling in each division processing to obtain a sample set to be trained;
performing classification training on a preset semi-supervised model based on the sample set to be trained to obtain a trained classification prediction model;
and carrying out classification prediction on the collected platform service data through the classification prediction model to obtain a classification score corresponding to the platform service data.
2. The semi-supervised model based classification prediction method of claim 1, wherein the step of setting a pseudo label for each unlabeled sample in the unlabeled sample set according to a preset label setting algorithm to obtain a second labeled sample set comprises:
determining an initial label of the unlabeled sample set according to the label information of the first labeled sample set;
and setting a pseudo label for each unlabeled sample according to the initial label to obtain a second labeled sample set.
3. The semi-supervised model-based classification prediction method of claim 2, wherein after obtaining the second labeled sample set, the method further comprises:
acquiring service associated data of each second labeled sample in the second labeled sample set;
and adjusting the pseudo label of the second labeled sample according to the business correlation of the business correlation data and a preset business target, so that the value of the pseudo label of the second labeled sample with high business correlation is larger than the value of the pseudo label of the second labeled sample with low business correlation.
4. The semi-supervised model-based classification prediction method of claim 3, wherein the adjusting the pseudo label of the second labeling sample according to the business correlation between the business correlation data and a preset business target comprises:
performing correlation analysis on the service correlation data and a preset service target to obtain service correlation degrees corresponding to the second labeled samples;
sequencing the second labeled samples according to the descending numerical value of the business relevance of the second labeled samples;
and adjusting the value of the pseudo label of each second labeled sample according to the sequence of the second labeled sample.
5. The semi-supervised model-based classification prediction method of claim 3, wherein the obtaining of the service related data of each second labeled sample in the second labeled sample set comprises:
acquiring service data corresponding to each second labeled sample in a preset time period;
and summarizing the service data to serve as service associated data of the corresponding second labeling sample.
6. The semi-supervised model-based classification and prediction method of claim 1, wherein the dividing the second sample set for multiple times and combining multiple samples to be trained, which are randomly sampled at each dividing process, to obtain a sample set to be trained, comprises:
determining a value of a random seed used to partition the second sample set to be K;
selecting K sample data from the second sample set as a sample to be trained respectively until the second sample set is divided into a preset number of samples to be trained;
changing the value of the random seed, and re-dividing the second sample set by the changed random seed until the number of the samples to be trained meets a preset sample number;
and combining a plurality of samples to be trained obtained by dividing the second sample set to obtain a sample set to be trained.
7. The semi-supervised model-based classification prediction method of claim 1, wherein before obtaining the sample set to be trained, the classification prediction method further comprises:
performing multi-level target division processing on the first labeling sample set to enable each first labeling sample in the first labeling sample set to correspond to a level target; wherein a higher value of the hierarchical target indicates a higher correlation with a preset business target;
correspondingly, the classification training of the preset semi-supervised model based on the sample set to be trained to obtain the trained classification prediction model comprises:
and carrying out classification training on a preset semi-supervised model based on the sample set to be trained and the hierarchical target corresponding to each first labeling sample to obtain a trained classification prediction model.
8. The semi-supervised model-based classification prediction method of claim 7, wherein before obtaining the sample set to be trained, the classification prediction method further comprises:
performing multi-level target division processing on the second labeled sample set, so that each second labeled sample in the second labeled sample set corresponds to a predicted level target;
correspondingly, the classifying and training a preset semi-supervised model based on the sample set to be trained and the level target corresponding to each labeled sample to obtain a trained classification prediction model, including:
and carrying out classification training on a preset semi-supervised model based on the sample set to be trained, the hierarchical target corresponding to each first labeled sample and the hierarchical target corresponding to each second labeled sample to obtain a trained classification prediction model.
9. An electronic device, comprising: at least one processor, and,
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions for execution by the at least one processor to cause the at least one processor, when executing the instructions, to implement the semi-supervised model based classification prediction method of any one of claims 1 to 8.
10. A computer-readable storage medium having stored thereon computer-executable instructions for performing at least the semi-supervised model based classification prediction method of any one of claims 1 to 8.
CN202210688575.1A 2022-06-17 2022-06-17 Semi-supervised model-based classification prediction method, equipment and storage medium Pending CN114997322A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210688575.1A CN114997322A (en) 2022-06-17 2022-06-17 Semi-supervised model-based classification prediction method, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210688575.1A CN114997322A (en) 2022-06-17 2022-06-17 Semi-supervised model-based classification prediction method, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN114997322A true CN114997322A (en) 2022-09-02

Family

ID=83034703

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210688575.1A Pending CN114997322A (en) 2022-06-17 2022-06-17 Semi-supervised model-based classification prediction method, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN114997322A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115147420A (en) * 2022-09-05 2022-10-04 北方健康医疗大数据科技有限公司 Inter-slice correlation detection model training method, detection method and electronic equipment
CN117497064A (en) * 2023-12-04 2024-02-02 电子科技大学 Single-cell three-dimensional genome data analysis method based on semi-supervised learning

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115147420A (en) * 2022-09-05 2022-10-04 北方健康医疗大数据科技有限公司 Inter-slice correlation detection model training method, detection method and electronic equipment
CN117497064A (en) * 2023-12-04 2024-02-02 电子科技大学 Single-cell three-dimensional genome data analysis method based on semi-supervised learning

Similar Documents

Publication Publication Date Title
Sculley et al. Detecting adversarial advertisements in the wild
CN113822494A (en) Risk prediction method, device, equipment and storage medium
CN114997322A (en) Semi-supervised model-based classification prediction method, equipment and storage medium
CN110019790A (en) Text identification, text monitoring, data object identification, data processing method
Lawrence et al. Data generating process to evaluate causal discovery techniques for time series data
CN111160959A (en) User click conversion estimation method and device
CN113449012A (en) Internet service mining method based on big data prediction and big data prediction system
CN110708285A (en) Flow monitoring method, device, medium and electronic equipment
Hassanat et al. Magnetic force classifier: a Novel Method for Big Data classification
CN114610475A (en) Training method of intelligent resource arrangement model
Moniz et al. A framework for recommendation of highly popular news lacking social feedback
CN116523622A (en) Object risk prediction method and device, electronic equipment and storage medium
CN116707859A (en) Feature rule extraction method and device, and network intrusion detection method and device
Olorunnimbe et al. Intelligent adaptive ensembles for data stream mining: a high return on investment approach
Liapis et al. A multivariate ensemble learning method for medium-term energy forecasting
CN115099344A (en) Model training method and device, user portrait generation method and device, and equipment
Vasti et al. Classification and analysis of real-world earthquake data using various machine learning algorithms
CN111784069B (en) User preference prediction method, device, equipment and storage medium
CN111159397B (en) Text classification method and device and server
Walkowiak et al. Utilizing local outlier factor for open-set classification in high-dimensional data-case study applied for text documents
Gómez-Boix et al. Consumer segmentation through multi-instance clustering time-series energy data from smart meters
Upadhyay et al. A machine learning approach in 5G user prediction
Pal et al. Comparing various classifier techniques for efficient mining of data
CN111552827A (en) Labeling method and device, and behavior willingness prediction model training method and device
Meyer et al. Categorizing Learning Objects Based On Wikipedia as Substitute Corpus.

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination