CN114565030A

CN114565030A - Feature screening method and device, electronic equipment and storage medium

Info

Publication number: CN114565030A
Application number: CN202210146248.3A
Authority: CN
Inventors: 李硕; 张巨岩; 许韩晨玺; 许海洋; 岳洪达
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2022-02-17
Filing date: 2022-02-17
Publication date: 2022-05-31
Anticipated expiration: 2042-02-17
Also published as: CN114565030B

Abstract

The disclosure provides a feature screening method, a feature screening device, electronic equipment and a storage medium, and relates to the technical field of artificial intelligence, in particular to the technical field of deep learning and financial wind control. The specific implementation scheme is as follows: obtaining a plurality of first samples, wherein the first samples comprise characteristic values corresponding to a plurality of candidate characteristics, and the first samples have corresponding real labels; obtaining at least one second sample corresponding to the plurality of first samples respectively, wherein the second sample and the corresponding first sample have the same characteristic value; generating a pseudo label corresponding to a second sample based on the plurality of first samples and the real labels; determining importance of a plurality of candidate features in the corresponding first sample based on the plurality of second samples and the pseudo labels; and screening the candidate features according to the importance degree to obtain the target feature. Therefore, the target characteristics which have large influence on the performance of the wind control model are screened out from the candidate characteristics, and the target characteristics suitable for different scenes can be screened out.

Description

Feature screening method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of artificial intelligence technologies, and in particular, to the field of deep learning and financial wind control technologies, and in particular, to a feature screening method and apparatus, an electronic device, and a storage medium.

Background

With the continuous progress of machine learning technology, the application of machine models in the field of financial wind control is more and more extensive. The characteristics which have large influence on the performance of the wind control model are screened out from the large quantity of characteristics, so that the screened characteristics are utilized to train the wind control model, and the method has important significance for improving the accuracy of the prediction result of the wind control model.

Disclosure of Invention

The disclosure provides a method, an apparatus, an electronic device and a storage medium for feature screening.

According to an aspect of the present disclosure, there is provided a feature screening method, the method including: obtaining a plurality of first samples, wherein the first samples comprise feature values corresponding to a plurality of candidate features, and the first samples have corresponding real labels; obtaining at least one second sample corresponding to the plurality of first samples respectively, wherein the second sample and the corresponding first sample have the same characteristic value; generating a pseudo label corresponding to a second sample based on a plurality of the first samples and the real labels; determining the importance of a plurality of candidate features in the corresponding first sample based on a plurality of the second samples and the pseudo labels; and screening the candidate characteristics according to the importance degree to obtain target characteristics.

According to another aspect of the present disclosure, there is provided a feature screening apparatus, the apparatus including: the device comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring a plurality of first samples, the first samples comprise characteristic values corresponding to a plurality of candidate characteristics, and the first samples have corresponding real labels; a second obtaining module, configured to obtain at least one second sample corresponding to each of the plurality of first samples, where the second sample and the corresponding first sample have a same feature value; a generating module, configured to generate a pseudo tag corresponding to a second sample based on the plurality of first samples and the real tag; a determining module, configured to determine importance of the candidate features in the corresponding first sample based on the second samples and the pseudo labels; and the screening module is used for screening the candidate characteristics according to the importance degree so as to obtain the target characteristics.

According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the feature screening method of the present disclosure.

According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform a feature filtering method disclosed in embodiments of the present disclosure.

According to another aspect of the present disclosure, a computer program product is provided, comprising a computer program which, when executed by a processor, performs the steps of the feature screening method of the present disclosure.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

fig. 1 is a schematic flow chart of a feature screening method according to a first embodiment of the present disclosure;

FIG. 2 is a schematic flow chart diagram of a feature screening method according to a second embodiment of the present disclosure;

FIG. 3 is a schematic structural diagram of a feature screening apparatus according to a third embodiment of the present disclosure

Fig. 4 is a schematic structural view of a feature screening apparatus according to a fourth embodiment of the present disclosure;

fig. 5 is a block diagram of an electronic device for implementing a feature screening method of an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of embodiments of the present disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

It should be noted that, in the technical solution of the present disclosure, the acquisition, storage, application, and the like of the personal information of the related user all meet the regulations of the relevant laws and regulations, and do not violate the common customs of the public order.

With the continuous progress of machine learning technology, the application of machine models in the field of financial wind control is more and more extensive. For example, a wind control model in the field of financial wind control can be used for risk assessment of financial risk. The bank can make a corresponding wind control strategy according to the scoring result of the wind control model, and wind control on various credit products is realized. The accuracy of the prediction result of the wind control model plays a crucial role in formulating a reasonable wind control strategy.

Taking an external joint modeling scene as an example, due to the data privacy problem in the field of financial wind control, both parties of the joint modeling cannot acquire the real label of the other party, so in order to realize the external joint modeling, usually one party provides a feature, the other party provides a label, and in an encryption environment, the feature provided by one party and the label provided by the other party are used for sample modeling. The encryption environment has a limit on the data size, and because the encryption environment has limited computing memory, the model can only be trained by using a small part of features. Therefore, it is necessary to select a small number of features from the plurality of features, which have a large influence on the performance of the model, and train the model using the selected features to improve the performance of the model. Wherein, the performance of the model is the accuracy of the evaluation result of the model in the risk evaluation.

In the related art, an unsupervised mode is generally used, and some special indexes of the features are used for realizing the screening of the features. For example, in the related art, it is common to determine the coverage rate of each feature, and select the features with higher coverage rate from each feature for training the wind control model. In this way, since the coverage rate of each feature is fixed and unchanged for different scenes, it is impossible to screen out different features for different scenes.

The present disclosure provides a feature screening method, an apparatus, an electronic device, a non-transitory computer readable storage medium, and a computer program product for risk assessment, which enable screening of a target feature from a plurality of candidate features based on data enhancement, by obtaining a plurality of first samples, the first samples including feature values corresponding to the plurality of candidate features, and the first samples having corresponding real tags, obtaining at least one second sample corresponding to each of the plurality of first samples, wherein the second samples and the corresponding first samples have the same feature values, generating pseudo tags corresponding to the second samples based on the plurality of first samples and the real tags, determining importance of the plurality of candidate features in the corresponding first samples based on the plurality of second samples and the pseudo tags, screening the plurality of candidate features according to the importance to obtain the target feature, and enabling screening of the target feature having a large influence on a behavior of a wind control model from the plurality of candidate features, therefore, the target characteristics are utilized to train the wind control model, the performance of the wind control model can be improved, the target characteristics suitable for different scenes can be screened out, the wind control model in the corresponding scene is trained by utilizing the target characteristics more suitable for the scenes, and the performance of the wind control model in the corresponding scene can be further improved.

The disclosure provides a feature screening method, a feature screening device, an electronic device, a non-transitory computer readable storage medium and a computer program product, which relate to the technical field of artificial intelligence, in particular to the technical field of deep learning and financial wind control.

The artificial intelligence is a subject for researching and enabling a computer to simulate certain thinking process and intelligent behaviors (such as learning, reasoning, thinking, planning and the like) of a human, and has a hardware level technology and a software level technology. Artificial intelligence hardware technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing, and the like; the artificial intelligence software technology mainly comprises computer vision, a voice recognition technology, a natural language processing technology, machine learning/deep learning, a big data processing technology, a knowledge map technology and the like.

The financial wind control means that the financial risk manager reduces or eliminates various possible risk events generated in the financial transaction process or reduces the loss caused by the risk events by adopting respective measures and methods. Financial wind control is an important link in the financial transaction process.

The feature screening method, apparatus, electronic device, non-transitory computer-readable storage medium, and computer program product of the embodiments of the present disclosure are described below with reference to the accompanying drawings.

Fig. 1 is a schematic flow chart of a feature screening method according to a first embodiment of the present disclosure. It should be noted that, in the feature filtering method of this embodiment, an execution main body is a feature filtering apparatus, the feature filtering apparatus may be implemented by software and/or hardware, the feature filtering apparatus may be configured in an electronic device, and the electronic device may include, but is not limited to, a terminal device such as a smart phone and a computer, a server, and the like, and the embodiment does not specifically limit the electronic device.

As shown in fig. 1, the feature screening method may include:

step 101, a plurality of first samples are obtained, wherein the first samples comprise feature values corresponding to a plurality of candidate features, and the first samples have corresponding real tags.

The candidate features are features from which target features need to be screened. The characteristics, which are characteristics to be referred to for risk assessment, may indicate attributes of the user, such as age, gender, whether a certain application is installed on a terminal device used by the user, whether the user purchased goods at a certain store, and the like. Accordingly, the candidate features may include various kinds of features such as age, gender, whether a certain application is installed in a terminal device used by the user, whether the user purchased a commodity in a certain store, and the like.

The feature value corresponding to the candidate feature may represent an attribute value of the user. For example, for a candidate feature of age, the corresponding feature value is the age of the user; for the candidate feature of gender, the corresponding feature value is that the gender of the user is male or female; for the candidate feature that whether the terminal equipment used by the user installs a certain application program, the corresponding feature value is that the terminal equipment used by the user installs the application program, or the terminal equipment used by the user does not install the application program; for a candidate feature of whether the user purchased a commodity at a certain store, the corresponding feature value is that the user purchased the commodity at the store, or the user did not purchase the commodity at the store.

And the real label is used for marking the category of the first sample. It should be noted that the type of the real label corresponds to the usage scenario of the wind control model. For example, the wind control model is used to predict whether the user can pay on time, and then the category labeled by the real tag may include two categories, that is, pay on time or not pay on time; the wind control model is used for predicting whether debt urging is required to be carried out on the user, and then the categories marked by the real labels can include two categories of debt urging before repayment or debt not urging before repayment.

It should be noted that the feature screening apparatus in the embodiment of the present disclosure may obtain the first sample in various public and legal compliance manners, for example, the first sample may be obtained from a public data set, or the first sample may also be obtained from a user after authorization of the user, which is not limited in this disclosure. It should be noted that the attribute value of the user and the real tag of the first sample in this embodiment are not the attribute value and the real tag for a specific user, and cannot reflect the personal information of a specific user.

And 102, acquiring at least one second sample corresponding to the plurality of first samples respectively, wherein the second sample and the corresponding first sample have the same characteristic value.

The plurality of first samples may be all the first samples obtained by the feature screening apparatus, or may be part of the first samples obtained by the feature screening apparatus, which is not limited in the present disclosure.

In the embodiment of the present disclosure, taking a plurality of first samples as an example of all first samples obtained by the feature screening apparatus, for each first sample obtained by the feature screening apparatus, at least one copy may be performed on feature values corresponding to a plurality of candidate features included in the first sample to generate at least one second sample corresponding to the first sample.

For example, assuming that the feature screening apparatus obtains 100 first samples, each of which includes feature values corresponding to a plurality of candidate features, for each first sample, 10 copies of the feature values corresponding to the plurality of candidate features included in the first sample may be performed, so as to generate, for each first sample, 10 second samples having the same feature value as that of the first sample, thereby obtaining 1000 second samples.

Step 103, generating a pseudo label corresponding to the second sample based on the plurality of first samples and the real labels.

The plurality of first samples may be all the first samples obtained by the feature screening apparatus, or may be a part of the first samples obtained by the feature screening apparatus, which is not limited in the present disclosure.

And the pseudo label is used for marking the category of the second sample. Note that the type of the pseudo label also corresponds to the usage scenario of the wind control model.

Taking an external combined modeling scene in the financial wind control field as an example, due to a data privacy problem in the financial wind control field, the feature screening device cannot directly acquire the real tags corresponding to the first samples, and then the candidate features are screened by using the real tags, while the real tags corresponding to the first samples can be acquired in the encryption environment, so in the embodiment of the disclosure, the pseudo tags corresponding to the second samples can be generated based on the first samples and the real tags in combination with the encryption environment, and then the candidate features are screened by using the pseudo tags corresponding to the second samples through the following steps.

And 104, determining the importance of the candidate features in the corresponding first sample based on the second samples and the pseudo labels.

The plurality of second samples may be all of the second samples obtained in step 102, or may be part of the second samples obtained in step 102, which is not limited in this disclosure. The plurality of candidate features may be all candidate features in step 101 or may be part of candidate features, which is not limited in this disclosure.

And 105, screening a plurality of candidate features according to the importance degree to obtain the target feature.

The importance represents the importance of the candidate feature.

In the embodiment of the disclosure, after the pseudo labels corresponding to the plurality of second samples are generated, importance degrees corresponding to the plurality of candidate features in the corresponding first sample are determined based on the plurality of second samples and the pseudo labels, and then the candidate features with the importance degrees larger than a preset importance degree threshold value are screened out from the plurality of candidate features and used as the target features. The preset importance threshold may be arbitrarily set as required, and the present disclosure is not limited thereto.

In the embodiment of the disclosure, after the pseudo labels of the plurality of second samples are generated, the importance degrees corresponding to the plurality of candidate features in the corresponding first sample are determined based on each second sample and the pseudo labels, and the plurality of candidate features are sorted in the order from high to low in importance degree, so that the top preset number of candidate features are used as the target features. The preset number can be set arbitrarily according to needs, and the disclosure is not limited thereto.

Through the process, the target features are screened out from the candidate features according to the importance degrees respectively corresponding to the candidate features. In addition, because the importance degrees corresponding to the candidate features in the first sample are determined based on the corresponding second sample and the corresponding pseudo label, the target feature with a higher importance degree is a feature having a greater influence on the determination process of the pseudo label of the second sample, and the pseudo label of the second sample is generated based on the corresponding first sample and the real label corresponding to the first sample, and the positive correlation between the pseudo label of the second sample and the real label corresponding to the first sample is higher, so that the target feature is a feature having a greater influence on the determination process of the real label corresponding to the first sample. And the category of the real label corresponds to the use scene of the wind control model, so that the target characteristics screened from the candidate characteristics are the characteristics with higher influence on the performance of the wind control model according to the importance degree of the candidate characteristics, and the performance of the wind control model can be improved by training the wind control model by using the target characteristics. In addition, as for different scenes, the real labels corresponding to the first samples are different, the technical scheme provided by the embodiment of the disclosure can screen out target features suitable for the scenes for the different scenes, and then train the wind control model in the corresponding scene by using the target features more suitable for the scenes, so that the performance of the wind control model in the corresponding scene can be further improved.

In summary, the feature screening method provided in the embodiments of the present disclosure obtains a plurality of first samples, where each first sample includes feature values corresponding to a plurality of candidate features, and each first sample has a corresponding real tag, obtains at least one second sample corresponding to each first sample, where each second sample and the corresponding first sample have the same feature value, generates a pseudo tag corresponding to each second sample based on the plurality of first samples and the real tags, determines importance of the plurality of candidate features in the corresponding first sample based on the plurality of second samples and the pseudo tags, and screens the plurality of candidate features according to the importance to obtain a target feature, thereby achieving screening of the target feature having a large influence on the performance of the wind control model from the plurality of candidate features, and thus improving the performance of the wind control model by training the wind control model using the target feature, moreover, target features suitable for different scenes can be screened out, so that the wind control model in the corresponding scene is trained by using the target features more suitable for the scenes, and the performance of the wind control model in the corresponding scene can be further improved.

With reference to fig. 2, a process of generating a pseudo tag corresponding to a second sample based on a plurality of first samples and real tags and determining importance of a plurality of candidate features in the corresponding first samples based on the plurality of second samples and the pseudo tag in the feature screening method provided by the present disclosure is further described below.

Fig. 2 is a schematic flow chart diagram of a feature screening method according to a second embodiment of the present disclosure. As shown in fig. 2, the feature screening method may include the following steps:

step 201, obtaining a plurality of first samples, where the first samples include feature values corresponding to a plurality of candidate features, and the first samples have corresponding real tags.

Step 202, at least one second sample corresponding to each of the plurality of first samples is obtained, wherein the second sample and the corresponding first sample have the same characteristic value.

The specific implementation process and principle of step 201-202 may refer to the description of the foregoing embodiments, and are not described herein again.

Step 203, determining the probability of generating the pseudo label corresponding to the first sample based on the plurality of first samples and the corresponding real labels.

The probability of generating the pseudo label in the first sample can be understood as the probability that the pseudo label in the first sample is a preset pseudo label. Taking the wind control model as a binary model as an example, assuming that two prediction categories corresponding to the wind control model are respectively labeled with pseudo labels 1 and 0, and the preset pseudo label is 1, the probability that the first sample generates the pseudo label may be the probability that the pseudo label of the first sample is 1.

For example, assuming that a wind control model is used to predict whether a user can pay on time, then the pseudo-label may include 1 and 0. Wherein, 1 represents that the user does not pay on time, and 0 represents that the user pays on time. The probability that the first sample generates the pseudo label may be the probability that the pseudo label of the first sample is 1, that is, the probability that the user corresponding to the first sample is not paid on time.

Taking an external combined modeling scene in the financial wind control field as an example, due to a data privacy problem in the financial wind control field, the feature screening device cannot directly acquire the real tags corresponding to the first samples, and then performs subsequent processing by using the real tags, while the encryption environment can acquire the real tags corresponding to the first samples, so that in the embodiment of the disclosure, the probability that the first samples generate the pseudo tags can be determined based on the first samples and the real tags and by combining the encryption environment.

In an embodiment of the present disclosure, the probability that the plurality of first samples generate the pseudo label may be determined by: dividing a plurality of first samples into a training set, a verification set and a test set, and determining the quantity proportion of the first samples in the training set, the verification set and the test set; training a first initial model in an encryption environment by adopting a characteristic value corresponding to at least one first characteristic included in a plurality of first samples in a training set and a real label corresponding to the plurality of first samples to obtain a trained first target model, wherein the first characteristic is screened from a plurality of candidate characteristics; respectively inputting the plurality of first samples into a first target model to obtain the confidence degrees of the corresponding categories of the first samples; respectively determining the prediction accuracy of the first target model to the class to which the samples of the corresponding set belong according to the confidence coefficient and the corresponding real label of at least one class to which the first samples belong, which are included in any one of the training set, the verification set and the test set; and determining the probability of generating the pseudo label corresponding to the first sample according to the confidence coefficient, the prediction accuracy and the quantity proportion of the category to which the plurality of first samples belong.

The number proportion is the proportion of the number of the first samples in the training set, the verification set and the test set after the plurality of first samples are divided into the training set, the verification set and the test set. The plurality of first samples may be all first samples obtained by the feature screening apparatus. For example, if the feature screening apparatus obtains 100 first samples, of which 80 are divided into the training set, 10 are divided into the verification set, and 10 are divided into the test set, the ratio of the number of the first samples in the training set, the verification set, and the test set is 8:1: 1.

The first initial model may be any model capable of performing category prediction on the first sample, which is not limited in this disclosure.

In the embodiment of the disclosure, any feature screening method in the related art may be adopted to screen at least one first feature from a plurality of candidate features, and then a feature value corresponding to at least one candidate feature included in a plurality of first samples in a training set and a real label corresponding to the plurality of first samples are adopted to train a first initial model in an encryption environment, so as to obtain a trained first target model. After the trained first target model is obtained, the plurality of first samples can be respectively input into the first target model, so that the first target model is utilized to carry out class prediction on the corresponding first samples, and the confidence coefficient corresponding to the class of the first samples is obtained. The first sample input into the first target model may include a part of the first sample obtained by the feature screening apparatus, or may also include all the first samples obtained by the feature screening apparatus, which is not limited by the present disclosure. The plurality of first samples in the training set may be a part of the first samples in the training set, or all of the first samples in the training set in order to improve the prediction accuracy of the first target model, which is not limited by the present disclosure.

When the first initial model is trained, for example, deep learning can be performed, and the deep learning performs better on a large data set compared with other machine learning methods. The process of training the first initial model to obtain the trained first target model may refer to related technologies, which are not described herein again.

Furthermore, the prediction accuracy of the first target model to the class to which the samples of the corresponding set belong can be respectively determined according to the confidence coefficient and the corresponding real label of the class to which at least one first sample belongs included in any one of the training set, the verification set and the test set, and then the probability of generating the pseudo label corresponding to the first sample is determined according to the confidence coefficient, the prediction accuracy and the quantity proportion of the classes to which the plurality of first samples belong.

That is, the prediction accuracy of the first target model for the class to which the samples of the training set belong may be determined according to the true label corresponding to at least one first sample in the training set and the confidence of the class to which the first sample belongs. The prediction accuracy of the first target model to the class to which the samples of the verification set belong may be determined according to the true label corresponding to at least one first sample in the verification set and the confidence level of the class to which the first sample belongs. The prediction accuracy of the first target model to the class to which the samples of the test set belong can be determined according to the real label corresponding to at least one first sample in the test set and the confidence of the class to which the first sample belongs.

It should be noted that, taking the wind control model as a two-class model, and the classes labeled by the real labels and the pseudo labels both include class one and class two, the prediction accuracy of the first target model to the class to which the samples of any one set in the training set, the verification set and the testing set belong all includes the prediction accuracy of the first target model to the class one to which the samples of the set belong, and the prediction accuracy of the first target model to the class two to which the samples of the set belong.

Taking the wind control model as a two-class model, and the classes marked by the real label and the pseudo label both comprise a class one and a class two, a process of determining the prediction accuracy of the first target model to the class to which the sample of the training set belongs is described.

Assuming that the confidence degree of the class to which the first sample output by the first target model belongs is greater than 0.5, the class to which the first sample belongs is represented as class one, and the first sample is marked as 1; and when the confidence coefficient of the class to which the first sample output by the first target model belongs is not more than 0.5, the class to which the first sample belongs is represented as a class two, and the first sample is marked as 0. Determining the ratio of the first samples with the real labels of 1 in all the first samples marked as 1 in the training set as the prediction accuracy of the first target model to the samples of the training set belonging to the first class; and determining the proportion of the first samples with the real labels of 1 in all the first samples marked as 0 in the training set as the prediction accuracy of the first target model to the samples of the training set belonging to the class two. Wherein, assuming that the training set includes 30 first samples labeled as 1, and the number of the first samples with the true label of 1 in the 30 first samples is 20, it may be determined that the prediction accuracy of the first target model on the samples of the training set belonging to the category one is 2/3.

It should be noted that, in the embodiment of the present disclosure, for any first sample, there may be the following two cases: firstly, the first sample can be determined to be the first sample in which set of a training set, a verification set and a test set; and secondly, the first sample cannot be determined to be the first sample in any one of the training set, the verification set and the test set. In the following, the process of determining the probability of generating the pseudo label corresponding to the first sample according to the confidence, the prediction accuracy and the quantity ratio of the categories to which the plurality of first samples belong in the two cases will be described by taking the wind control model as the two classification models, and the categories to which the real label and the pseudo label are labeled both include category one and category two. The plurality of first samples here may be all the first samples obtained by the feature screening apparatus, or may be part of the first samples obtained by the feature screening apparatus, which is not limited in the present disclosure.

In the embodiment of the present disclosure, in a case that it may be determined which set of the training set, the verification set, and the test set each first sample in the plurality of first samples is a first sample in the set, the probability of generating the pseudo tag corresponding to the first sample may be determined according to the confidence, the prediction accuracy, and the quantity ratio of the category to which the plurality of first samples belong in the following manner: for at least one first sample included in any one set of a training set, a verification set or a test set, taking the prediction accuracy of the first target model to the class to which the samples of the set belong as the prediction accuracy of the class to which the at least one first sample in the corresponding set belongs; and determining the probability of generating the pseudo label corresponding to the first sample according to the prediction accuracy, the confidence coefficient and the quantity proportion of the category to which at least one first sample in the set belongs.

Taking a first sample in the training set as an example, the prediction accuracy of the first target model to the class to which the sample of the training set belongs can be used as the prediction accuracy of the class to which the first sample belongs in the training set, and the probability that the first sample belongs to the training set is determined according to the quantity proportion, so that the probability that the first sample generates the pseudo tag in the training set is determined according to the prediction accuracy of the class to which the first sample belongs in the training set, the confidence coefficient of the class to which the first sample belongs and the probability that the first sample belongs to the training set. The prediction accuracy of the first target model to the class to which the samples of the training set belong includes the prediction accuracy of the first target model to the class one to which the samples of the training set belong and the prediction accuracy of the first target model to the class two to which the samples of the training set belong.

Specifically, the probability of generating the pseudo label for the nth first sample in the training set can be determined by the following formula (1).

Wherein, label_nThe true tag representing the nth first sample. S_nRepresenting the nth first sample. l_iIndicating the value of the flag. pred_nRepresenting the confidence of the class to which the nth first sample belongs. P (label)_n＝1|S_nE, train) represents the probability that the nth first sample in the training set generates a pseudo label. P (label)_n＝1|S_n∈train,pred_n＝l_i) Representing a mark l in a training set_iThe ratio of the first sample with the real label of 1, that is, the prediction accuracy of the first target model to the class to which the samples of the training set belong. P (pred)_n＝l_i|S_nE, train) represents the confidence of the class to which the nth first sample belongs in the training set. When the confidence coefficient output by the first target model is the confidence coefficient that the class to which the first sample belongs is the class one, the confidence coefficient that the class to which the nth first sample belongs is the class one output by the first target model is the P (pred)_n＝1|S_nE.g. train) calculation, 1-P (pred)_n＝1|S_nE.g., train) is P (pred)_n＝0|S_nE, train).

It should be noted that, in the case that it can be determined which of the training set, the verification set, and the test set is the nth first sample, the probability that the nth first sample belongs to the training set is 1, and thus, in the above formula (1), the probability that the nth first sample belongs to the training set is not reflected.

For at least one first sample in the verification set and the test set, determining the probability of generating the pseudo label corresponding to the first sample according to the confidence, the prediction accuracy and the quantity ratio of the class to which the first sample belongs, which is similar to the process of determining the probability of generating the pseudo label for at least one first sample in the training set, and is not repeated here.

Through the process, the probability that the plurality of first samples generate the pseudo labels is accurately determined under the condition that the plurality of first samples can be determined to be the first samples in any one of the training set, the verification set and the testing set.

In the embodiment of the present disclosure, in a case that it cannot be determined which set of the training set, the verification set, and the test set each first sample in the plurality of first samples is a first sample in the set, the probability of generating the pseudo tag corresponding to the first sample may be determined according to the confidence, the prediction accuracy, and the quantity ratio of the category to which the plurality of first samples belong in the following manner: for a plurality of first samples, the prediction accuracy of the first target model to the training set is used as the first prediction accuracy of the corresponding first sample in the training set; taking the prediction accuracy of the first target model to the verification set as a second prediction accuracy of the corresponding first sample in the verification set; taking the prediction accuracy of the first target model to the test set as a third prediction accuracy corresponding to the first sample in the test set; and determining the probability of generating the pseudo label corresponding to the first sample according to the first prediction accuracy, the second prediction accuracy, the third prediction accuracy, the confidence coefficient and the quantity proportion of the plurality of first samples.

Taking a first sample as an example, the prediction accuracy of a first target model to the class to which the samples of a training set belong can be used as the first prediction accuracy of the class to which the first sample in the training set belongs, the prediction accuracy of the first target model to a verification set can be used as the second prediction accuracy of the class to which the first sample in the verification set belongs, the prediction accuracy of the first target model to a test set can be used as the third prediction accuracy of the class to which the first sample in the test set belongs, and the probability that the first sample belongs to the training set, the probability that the first sample belongs to the verification set and the probability that the first sample belongs to the test set are determined according to the quantity proportion, and then the first prediction accuracy, the second prediction accuracy, the third prediction accuracy, the confidence coefficient of the class to which the first sample belongs and the probability that the first sample belongs to the training set are further determined according to the first prediction accuracy, the second prediction accuracy, the third prediction accuracy, the confidence coefficient of the class to which the first sample belongs to the training set, The probability that the first sample belongs to the validation set, the probability that the first sample belongs to the test set, and the probability that the first sample generates a pseudo label are determined.

Specifically, the probability of generating the pseudo label for the nth first sample in the training set can be determined by the following formula (2).

Wherein, label_nThe true tag representing the nth first sample. S_nRepresenting the nth first sample. l_iIndicating the value of the flag. pred_nRepresenting the confidence of the class to which the nth first sample belongs. P (label)_n1) denotes the probability that the nth first sample generates a pseudo tag. train represents a training set, valid represents a verification set, and oot represents a test set. set_iFor the training set, P (label)_n＝1|S_n∈set_i,pred_n＝l_i) Representing a mark l in a training set_iIn the first sample, the proportion of the first sample with a real label of 1, namely the prediction accuracy of the first target model to the class to which the samples of the training set belong; set_iTo validate the set, P (label)_n＝1|S_n∈set_i,pred_n＝l_i) Representing a mark in a validation set as l_iThe proportion of the first sample with the true label of 1, namely the prediction standard of the first target model to the class to which the sample of the verification set belongsDetermining the rate; set_iFor test set, P (label)_n＝1|S_n∈set_i,pred_n＝l_i) Indicating a mark in the test set as l_iThe ratio of the first sample with the real label of 1, that is, the prediction accuracy of the first target model to the class to which the samples of the test set belong.

P(pred_n＝l_i|S_n∈set_i) Representing the confidence of the class to which the nth first sample belongs in the training set, validation set, or test set. When the confidence coefficient output by the first target model is the confidence coefficient that the class to which the first sample belongs is the class one, the confidence coefficient that the class to which the nth first sample belongs is the class one output by the first target model is the P (pred)_n＝1|S_n∈set_i) 1-P (pred)_n＝1|S_n∈set_i) Is the value of P (pred)_n＝0|S_n∈set_i) The calculation result of (2).

P(S_n∈set_i) Representing the probability that the first sample belongs to the training set, test set, or validation set. Taking the number ratio of 8:1:1 as an example, the probability that the first sample belongs to the training set is 8/10, the probability that the first sample belongs to the test set is 1/10, and the probability that the first sample belongs to the verification set is 1/10.

Through the process, the probability that the plurality of first samples generate the pseudo labels is accurately determined under the condition that the plurality of first samples cannot be determined to be the first samples in any one of the training set, the verification set and the testing set.

The method comprises the steps of determining the quantity proportion of first samples in a training set, a verification set and a test set, training a first initial model in an encryption environment to obtain a trained first target model, obtaining confidence degrees of a plurality of first samples in the belonged categories by using the first target model, respectively determining the prediction accuracy of the first target model to the belonged categories of the samples of a corresponding set according to the confidence degree of at least one first sample in the belonged category and the corresponding real label in any set of the training set, the verification set and the test set, and determining the probability of generating the pseudo label corresponding to the first sample according to the confidence degrees, the prediction accuracy rates and the quantity proportion of the plurality of first samples in the belonged categories, so that the probability of generating the pseudo label by the plurality of first samples is accurately determined. And the probability of generating the pseudo label by the plurality of first samples is determined based on the confidence coefficient of the category to which the corresponding first sample belongs, and the confidence coefficient of the category to which the corresponding first sample belongs is determined based on the real label of the corresponding first sample, so that the pseudo label of the corresponding second sample is generated by the probability of generating the pseudo label according to the plurality of first samples, and the positive correlation between the generated pseudo label of the second sample and the real label of the corresponding first sample can be higher.

And step 204, generating a pseudo label corresponding to the second sample based on the probability of generating the pseudo label by the plurality of first samples.

The plurality of first samples may be all the first samples obtained by the feature screening model, or may be part of the first samples obtained by the feature screening model, which is not limited in the present disclosure.

The following describes a process of generating a pseudo label of at least one second sample corresponding to a plurality of first samples, by taking the wind control model as a two-class model, and the classes marked by the real label and the pseudo label both include a class one and a class two, and taking the pseudo label 1 and the pseudo label 0 as examples to mark two prediction classes corresponding to the wind control model respectively.

In an embodiment of the present disclosure, for a certain first sample, when the number of second samples corresponding to the first sample is 1, if the probability of generating a pseudo tag by the first sample is greater than a preset threshold, the pseudo tag of the second sample corresponding to the first sample may be determined as a preset pseudo tag. If the probability of generating the pseudo label by the first sample is not greater than the preset threshold, the pseudo label of the second sample corresponding to the first sample may be determined as another pseudo label. Wherein, the preset pseudo tag may be 1. The preset threshold value may be set as desired.

For a certain first sample, when the number of second samples corresponding to the first sample is multiple, the number of second samples, of which corresponding pseudo labels are preset pseudo labels, and the number of second samples, of which corresponding pseudo labels are other pseudo labels, in the multiple second samples corresponding to the first sample may be determined according to the probability of generating pseudo labels for the first sample, and the pseudo labels of the second samples, of which corresponding numbers are the preset pseudo labels or other pseudo labels, in the multiple second samples corresponding to the first sample are randomly set as the preset pseudo labels or other pseudo labels according to the two numbers. The ratio of the number of the second samples with the preset pseudo labels to the number of the second samples with the other pseudo labels is equal to the probability of generating the pseudo labels by the corresponding first samples. The preset dummy tag may be 1.

For example, assuming that the probability of generating the pseudo label for a certain first sample is 0.7, the number of the second samples corresponding to the first sample is 10, and the preset pseudo label is 1, it may be determined that, of the 10 second samples, the number of the second samples with the pseudo label of 1 is 7, and the number of the second samples with the pseudo label of 0 is 3, so that the pseudo labels of the random 7 second samples in the 10 second samples may be set to 1, and the pseudo labels of the other 3 second samples may be set to 0.

For a certain first sample, when the number of second samples corresponding to the first sample is multiple, the pseudo label corresponding to the second sample can be generated based on the probability of generating the pseudo label by the first sample by the following method: wherein, a first sample corresponding to the plurality of second samples is referred to as a first target sample, and random numbers of the plurality of second samples corresponding to the first target sample can be respectively generated; wherein the random numbers of a plurality of second samples corresponding to the same first target sample are uniformly distributed according to the random numbers; determining a pseudo label of a second target sample in a plurality of second samples corresponding to the first target sample as a first pseudo label, wherein a random number corresponding to the second target sample is greater than a target probability, and the target probability is the probability of generating the pseudo label by the first target sample; and determining the pseudo label of a third target sample in a plurality of second samples corresponding to the first target sample as a second pseudo label, wherein the random number corresponding to the third target sample is not greater than the target probability.

The first pseudo tag may be understood as the aforementioned preset pseudo tag, and the second pseudo tag may be understood as the aforementioned other pseudo tags.

For example, assuming that the probability of generating the pseudo tag for a first target sample is 0.7, the number of the second samples corresponding to the first target sample is 10, assuming that the first pseudo tag is 1, and the other pseudo tags are 0, then random numbers corresponding to the 10 second samples may be randomly generated, and of the 10 second samples, the second target sample corresponding to the random number greater than 0.7 and the third target sample corresponding to the random number not greater than 0.7 are determined, and then the pseudo tag corresponding to the second target sample is determined as 1, and the pseudo tag corresponding to the third target sample is determined as 0.

The pseudo labels of the corresponding second samples are generated according to the probability that the pseudo labels are generated for the first target sample corresponding to the plurality of second samples in the plurality of first samples and the probability that the pseudo labels are generated for the first target sample corresponding to the plurality of second samples, and the proportion of the preset pseudo labels in the pseudo labels of the plurality of second samples is the same as the probability that the pseudo labels are generated for the first target sample corresponding to the plurality of second samples.

Therefore, the probability of generating the pseudo label corresponding to the first sample is determined based on the first samples and the real labels, and the pseudo label corresponding to the second sample is generated based on the probability of generating the pseudo label corresponding to the first samples, so that the pseudo label with high positive correlation with the real labels of the first samples is generated under the condition of not acquiring the real labels corresponding to the first samples.

Step 205, determining the importance of the candidate features in the corresponding first sample based on the second samples and the pseudo labels.

In an embodiment of the present disclosure, the respective importance of the plurality of candidate features may be determined in the following manner: training the second initial model by adopting the characteristic values corresponding to the candidate characteristics included in the second samples and the pseudo labels of the second samples to obtain a trained second target model; wherein the second target model has learned the importance of the plurality of candidate features during and/or after training; and inputting at least one second sample into a second target model to obtain the importance of a plurality of candidate features in the corresponding first sample.

The second initial model is any model, such as a tree model, which can predict the class of the sample and determine the importance of the features used in the training process, and this disclosure does not limit this.

In the embodiment of the present disclosure, the at least one second sample input into the second target model may be a sample of second samples used when the second initial model is trained, or may be a sample different from the second sample used when the second initial model is trained, which is not limited by the present disclosure.

Taking as an example that at least one second sample of the second target model is input and is a sample different from the second sample used in training the second initial model, in the embodiment of the present disclosure, all the second samples may be divided into a training set and a verification set, and the second initial model is trained by using feature values corresponding to a plurality of candidate features included in each second sample in the training set and a pseudo tag corresponding to each second sample, so as to obtain the trained second target model. In the training process, the second initial model can learn not only how to predict the category to which the second sample belongs, but also the candidate features with higher effect on the prediction result of the model and the candidate features with lower effect on the prediction result of the model, so that the second target model learns the importance of a plurality of candidate features in the training process and/or after the training, at least one second sample in the verification set is input into the second target model, and the second target model can output the importance scores of the candidate features while outputting the prediction result of the category to which the second sample belongs in the verification set, thereby obtaining the importance of the candidate features in the first sample corresponding to the second sample.

When the second initial model is trained, for example, deep learning can be performed, and the deep learning performs better on a large data set compared with other machine learning methods. The process of training the second initial model to obtain the trained second target model may refer to related technologies, which are not described herein again.

Therefore, the importance of the candidate features can be accurately determined by adopting a model capable of sequencing the features based on the second samples and the pseudo labels.

And step 206, screening the candidate features according to the importance degree to obtain the target feature.

The specific implementation process and principle of step 206 may refer to the description of the foregoing embodiments, and are not described herein again.

The feature screening method of the embodiment of the disclosure obtains a plurality of first samples, each of the first samples includes a feature value corresponding to a plurality of candidate features, each of the first samples has a corresponding real label, obtains at least one second sample corresponding to each of the plurality of first samples, wherein the second sample has the same feature value as the corresponding first sample, determines a probability of generating a pseudo label corresponding to the first sample based on the plurality of first samples and the real labels, generates a pseudo label corresponding to the second sample based on the probability of generating the pseudo label of the plurality of first samples, determines importance of the plurality of candidate features in the corresponding first sample based on the plurality of second samples and the pseudo label, and realizes screening out a target feature having a large influence on the performance of the wind control model from the plurality of candidate features, so that the performance of the wind control model can be improved by training the target feature, moreover, target features suitable for different scenes can be screened out, so that the wind control model in the corresponding scene is trained by using the target features more suitable for the scenes, and the performance of the wind control model in the corresponding scene can be further improved.

The feature screening apparatus provided in the present disclosure is explained below with reference to fig. 3.

Fig. 3 is a schematic structural view of a feature screening apparatus according to a third embodiment of the present disclosure.

As shown in fig. 3, the present disclosure provides a feature screening apparatus 300, including: a first obtaining module 301, a second obtaining module 302, a generating module 303, a determining module 304, and a screening module 305.

The first obtaining module 301 is configured to obtain a plurality of first samples, where the first samples include feature values corresponding to a plurality of candidate features, and the first samples have corresponding real tags;

a second obtaining module 302, configured to obtain at least one second sample corresponding to each of the plurality of first samples, where the second sample and the corresponding first sample have a same feature value;

a generating module 303, configured to generate a pseudo tag corresponding to a second sample based on the plurality of first samples and the real tags;

a determining module 304, configured to determine importance of multiple candidate features in the corresponding first samples based on the multiple second samples and the pseudo labels;

the screening module 305 is configured to screen the multiple candidate features according to the importance to obtain the target feature.

It should be noted that the feature screening apparatus 300 provided in this embodiment can perform the feature screening method of the foregoing embodiment. The feature filtering apparatus 300 may be implemented by software and/or hardware, and the feature filtering apparatus 300 may be configured in an electronic device, where the electronic device may include, but is not limited to, a terminal device such as a smart phone and a computer, a server, and the like, and the embodiment does not specifically limit the electronic device.

It should be noted that the foregoing description of the embodiment of the feature screening method is also applicable to the feature screening apparatus provided in the present disclosure, and is not repeated herein.

The feature screening device provided by the embodiment of the disclosure obtains a plurality of first samples, each first sample includes feature values corresponding to a plurality of candidate features, each first sample has a corresponding real label, obtains at least one second sample corresponding to each first sample, wherein the second sample has the same feature value as the corresponding first sample, generates a pseudo label corresponding to the second sample based on the plurality of first samples and the real labels, determines importance of the plurality of candidate features in the corresponding first sample based on the plurality of second samples and the pseudo labels, and screens the plurality of candidate features according to the importance to obtain a target feature, so that the target feature having a large influence on the performance of the wind control model is screened from the plurality of candidate features, and the performance of the wind control model can be improved by training the wind control model by using the target feature, moreover, target features suitable for different scenes can be screened out, so that the wind control model in the corresponding scene is trained by using the target features more suitable for the scenes, and the performance of the wind control model in the corresponding scene can be further improved.

The feature screening apparatus provided in the present disclosure is further described below with reference to fig. 4.

Fig. 4 is a schematic structural view of a feature screening apparatus according to a fourth embodiment of the present disclosure.

As shown in fig. 4, the feature filtering apparatus 400 may specifically include: a first obtaining module 401, a second obtaining module 402, a generating module 403, a determining module 404 and a screening module 405. The first obtaining module 401, the second obtaining module 402, the generating module 403, the determining module 404, and the screening module 405 in fig. 4 have the same functions and structures as the first obtaining module 301, the second obtaining module 302, the generating module 303, the determining module 304, and the screening module 305 in fig. 3.

In an embodiment of the present disclosure, the generating module 403 includes:

a determining submodule 4031, configured to determine, based on the plurality of first samples and the corresponding real tags, a probability that the corresponding first samples generate pseudo tags;

a generating submodule 4032 configured to generate a pseudo label corresponding to the second sample based on a probability that the plurality of first samples generate the pseudo label.

In an embodiment of the present disclosure, the determining sub-module 4031 includes:

the first processing unit is used for dividing the plurality of first samples into a training set, a verification set and a test set and determining the quantity proportion of the first samples in the training set, the verification set and the test set;

the training unit is used for training the first initial model in an encryption environment by adopting a characteristic value corresponding to at least one first characteristic included in a plurality of first samples in a training set and a real label corresponding to the plurality of first samples to obtain a trained first target model, wherein the first characteristic is screened from a plurality of candidate characteristics;

the second processing unit is used for respectively inputting the plurality of first samples into the first target model so as to obtain the confidence degrees of the corresponding categories of the first samples;

the first determining unit is used for respectively determining the prediction accuracy of the first target model to the class to which the samples of the corresponding set belong according to the confidence coefficient and the corresponding real label of at least one class to which the first samples belong, which are included in any one of the training set, the verification set and the test set;

and the second determining unit is used for determining the probability of generating the pseudo label corresponding to the first sample according to the confidence degrees, the prediction accuracy and the quantity proportion of the categories of the plurality of first samples.

In an embodiment of the present disclosure, the second determining unit includes:

the first processing subunit is used for regarding at least one first sample included in any one set of the training set, the verification set or the test set, and taking the prediction accuracy of the first target model to the category to which the samples of the set belong as the prediction accuracy of the category to which the at least one first sample in the corresponding set belongs;

the first determining subunit is configured to determine, according to the prediction accuracy, the confidence level, and the quantity ratio of the category to which at least one first sample in the set belongs, a probability that the pseudo tag is generated corresponding to the first sample.

the second processing subunit is used for regarding the plurality of first samples, and taking the prediction accuracy of the first target model to the training set as the first prediction accuracy corresponding to the first sample in the training set;

the third processing subunit is used for taking the prediction accuracy of the first target model to the verification set as a second prediction accuracy corresponding to the first sample in the verification set;

the fourth processing subunit is configured to use the prediction accuracy of the first target model to the test set as a third prediction accuracy corresponding to the first sample in the test set;

and the second determining subunit is used for determining the probability of generating the pseudo label corresponding to the first sample according to the first prediction accuracy, the second prediction accuracy, the third prediction accuracy, the confidence coefficient and the quantity proportion of the plurality of first samples.

In an embodiment of the present disclosure, the first target sample of the plurality of first samples corresponds to the plurality of second samples; generating a submodule comprising:

a generation unit, configured to generate random numbers of a plurality of second samples corresponding to the first target sample, respectively; wherein the random numbers of a plurality of second samples corresponding to the same first target sample are uniformly distributed according to the random numbers;

a third determining unit, configured to determine, as the first pseudo tag, a pseudo tag of a second target sample in a plurality of second samples corresponding to the first target sample, where a random number corresponding to the second target sample is greater than a target probability, and the target probability is a probability of generating the pseudo tag by the first target sample;

and the fourth determining unit is used for determining the pseudo label of a third target sample in a plurality of second samples corresponding to the first target sample as a second pseudo label, wherein the random number corresponding to the third target sample is not greater than the target probability.

In an embodiment of the present disclosure, the determining module 404 includes:

the training sub-module 4041 is configured to train the second initial model by using feature values corresponding to a plurality of candidate features included in the plurality of second samples and pseudo labels of the plurality of second samples, so as to obtain a trained second target model; wherein the second target model has learned the importance of the plurality of candidate features during and/or after training;

the processing sub-module 4042 is configured to input at least one second sample into the second target model to obtain importance of the plurality of candidate features in the corresponding first sample.

Based on the above embodiment, the present disclosure also provides an electronic device, including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the feature screening method of the present disclosure.

Based on the above embodiments, the present disclosure also provides a non-transitory computer-readable storage medium storing computer instructions for causing the computer to execute the feature screening method disclosed in the embodiments of the present disclosure.

Based on the above embodiments, the present disclosure also provides a computer program product comprising a computer program which, when executed by a processor, implements the steps of the feature screening method of the present disclosure.

According to an embodiment of the present disclosure, an electronic device and a readable storage medium and a computer program product are also provided.

FIG. 5 illustrates a schematic block diagram of an example electronic device 500 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 5, the electronic device 500 may include a computing unit 501 that may perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM)502 or a computer program loaded from a storage unit 508 into a Random Access Memory (RAM) 503. In the RAM 503, various programs and data required for the operation of the device 500 can also be stored. The calculation unit 501, the ROM 502, and the RAM 503 are connected to each other by a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.

A number of components in the device 500 are connected to the I/O interface 505, including: an input unit 506 such as a keyboard, a mouse, or the like; an output unit 507 such as various types of displays, speakers, and the like; a storage unit 508, such as a magnetic disk, optical disk, or the like; and a communication unit 509 such as a network card, modem, wireless communication transceiver, etc. The communication unit 509 allows the device 500 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.

The computing unit 501 may be a variety of general-purpose and/or special-purpose processing components having processing and computing capabilities. Some examples of the computing unit 501 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 501 performs the respective methods and processes described above, such as the feature screening method. For example, in some embodiments, the feature screening method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 508. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 500 via the ROM 502 and/or the communication unit 509. When the computer program is loaded into the RAM 503 and executed by the computing unit 501, one or more steps of the feature screening method described above may be performed. Alternatively, in other embodiments, the computing unit 501 may be configured to perform the feature screening method in any other suitable manner (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), the internet, and blockchain networks.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The Server can be a cloud Server, also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service ("Virtual Private Server", or simply "VPS"). The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A method of feature screening, comprising:

obtaining a plurality of first samples, wherein the first samples comprise feature values corresponding to a plurality of candidate features, and the first samples have corresponding real labels;

obtaining at least one second sample corresponding to the plurality of first samples respectively, wherein the second sample and the corresponding first sample have the same characteristic value;

generating a pseudo label corresponding to a second sample based on a plurality of the first samples and the real labels;

determining the importance of a plurality of candidate features in the corresponding first sample based on a plurality of the second samples and the pseudo labels;

and screening the candidate characteristics according to the importance degree to obtain target characteristics.

2. The method of claim 1, wherein said generating a pseudo label corresponding to a second exemplar based on a plurality of said first exemplars and said real labels comprises:

determining a probability that a corresponding first sample generates a pseudo label based on a plurality of the first samples and the corresponding real labels;

and generating a pseudo label corresponding to a second sample based on the probability of generating the pseudo label by a plurality of first samples.

3. The method of claim 2, wherein said determining, based on a plurality of said first exemplars and said real label, a probability that a corresponding first exemplar generates a pseudo label comprises:

dividing a plurality of first samples into a training set, a verification set and a test set, and determining the quantity proportion of the first samples in the training set, the verification set and the test set;

training a first initial model in a cryptographic environment by using a feature value corresponding to at least one first feature included in a plurality of first samples in the training set and the real labels corresponding to the plurality of first samples to obtain a trained first target model, wherein the first feature is selected from a plurality of candidate features;

respectively inputting the plurality of first samples into the first target model to obtain the confidence degrees of the corresponding first samples;

respectively determining the prediction accuracy of the first target model to the class to which the samples of the corresponding set belong according to the confidence degree of at least one class to which the first sample belongs and the corresponding real label included in any one of the training set, the verification set and the testing set;

and determining the probability of generating the pseudo label corresponding to the first sample according to the confidence coefficient of the category to which the plurality of first samples belong, the prediction accuracy and the quantity proportion.

4. The method of claim 3, wherein the determining the probability of generating the pseudo label for the first sample according to the confidence levels of the classes to which the plurality of first samples belong, the prediction accuracy and the quantity ratio comprises:

for at least one first sample included in any one of the training set, the verification set and the test set, taking the prediction accuracy of the first target model to the class to which the samples of the set belong as the prediction accuracy of at least one first sample in the corresponding set;

and determining the probability of generating the pseudo label corresponding to the first sample according to the prediction accuracy, the confidence coefficient and the quantity proportion of the category to which at least one first sample in the set belongs.

5. The method of claim 3, wherein the determining the probability of generating the pseudo label for the first sample according to the confidence levels of the classes to which the plurality of first samples belong, the prediction accuracy and the quantity ratio comprises:

for a plurality of first samples, taking the prediction accuracy of the first target model to the training set as a first prediction accuracy of the corresponding first sample in the training set;

taking the prediction accuracy of the first target model to the verification set as a second prediction accuracy of a corresponding first sample in the verification set;

taking the prediction accuracy of the first target model to the test set as a third prediction accuracy of a corresponding first sample in the test set;

and determining the probability of generating the pseudo label corresponding to the first sample according to the first prediction accuracy, the second prediction accuracy, the third prediction accuracy, the confidence coefficient and the quantity proportion of the plurality of first samples.

6. The method of claim 2, wherein a first target sample of the plurality of first samples corresponds to the plurality of second samples; the generating a pseudo label corresponding to a second sample based on the probability of generating a pseudo label for a plurality of the first samples comprises:

respectively generating random numbers of a plurality of second samples corresponding to the first target sample; wherein the random numbers of the plurality of second samples corresponding to the same first target sample are uniformly distributed;

determining a pseudo label of a second target sample in a plurality of second samples corresponding to the first target sample as a first pseudo label, wherein a random number corresponding to the second target sample is greater than a target probability, and the target probability is a probability of generating the pseudo label by the first target sample;

and determining a pseudo label of a third target sample in the plurality of second samples corresponding to the first target sample as a second pseudo label, wherein the random number corresponding to the third target sample is not greater than the target probability.

7. The method of any of claims 1-6, wherein said determining the importance of the plurality of candidate features in the corresponding first sample based on the plurality of second samples and the pseudo label comprises:

training a second initial model by using feature values corresponding to the candidate features included in the second samples and pseudo labels of the second samples to obtain a trained second target model; wherein the second target model has learned the importance of a plurality of the candidate features during and/or after training;

and inputting at least one second sample into the second target model to obtain the importance of a plurality of candidate features in the corresponding first sample.

8. A feature screening apparatus comprising:

the device comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring a plurality of first samples, the first samples comprise characteristic values corresponding to a plurality of candidate characteristics, and the first samples have corresponding real labels;

the second acquisition module is used for acquiring at least one second sample corresponding to the plurality of first samples respectively, wherein the second sample and the corresponding first sample have the same characteristic value;

a generating module, configured to generate a pseudo tag corresponding to a second sample based on the plurality of first samples and the real tag;

a determining module, configured to determine importance of the candidate features in the corresponding first sample based on the plurality of second samples and the pseudo label;

and the screening module is used for screening the candidate features according to the importance degree so as to obtain the target features.

9. The apparatus of claim 8, wherein the generating means comprises:

a determining submodule, configured to determine, based on the plurality of first samples and the corresponding real tags, a probability that a pseudo tag is generated corresponding to a first sample;

and the generation submodule is used for generating a pseudo label corresponding to the second sample based on the probability of generating the pseudo label by the plurality of first samples.

10. The apparatus of claim 9, wherein the determination submodule comprises:

the first processing unit is used for dividing a plurality of first samples into a training set, a verification set and a test set and determining the quantity proportion of the first samples in the training set, the verification set and the test set;

a training unit, configured to train a first initial model in an encrypted environment by using a feature value corresponding to at least one first feature included in a plurality of first samples in the training set and the real tags corresponding to the plurality of first samples, so as to obtain a trained first target model, where the first feature is screened from the plurality of candidate features;

the second processing unit is used for respectively inputting the plurality of first samples into the first target model so as to obtain the confidence degrees of the corresponding first samples;

a first determining unit, configured to respectively determine, according to a confidence level of at least one class to which the first sample belongs and the corresponding real tag included in any one of the training set, the verification set, and the test set, a prediction accuracy rate of the class to which the sample of the corresponding set belongs for the first target model;

and the second determining unit is used for determining the probability of generating the pseudo label corresponding to the first sample according to the confidence degrees of the types of the plurality of first samples, the prediction accuracy and the quantity proportion.

11. The apparatus of claim 10, wherein the second determining unit comprises:

a first processing subunit, configured to, for at least one first sample included in any one of the training set, the verification set, or the test set, use a prediction accuracy of the first target model for a class to which samples of the set belong as a prediction accuracy of a class to which at least one first sample in a corresponding set belongs;

and the first determining subunit is used for determining the probability of generating the pseudo label corresponding to the first sample according to the prediction accuracy, the confidence coefficient and the quantity proportion of the category to which at least one first sample in the set belongs.

12. The apparatus of claim 10, wherein the second determining unit comprises:

the second processing subunit is configured to, for a plurality of first samples, use the prediction accuracy of the first target model to the training set as a first prediction accuracy of a corresponding first sample in the training set;

the third processing subunit is configured to use the prediction accuracy of the first target model to the verification set as a second prediction accuracy of a corresponding first sample in the verification set;

a fourth processing subunit, configured to use the prediction accuracy of the first target model on the test set as a third prediction accuracy of a corresponding first sample in the test set;

13. The apparatus of claim 9, wherein a first target sample of the plurality of first samples corresponds to the plurality of second samples; the generation submodule includes:

a generating unit configured to generate random numbers of a plurality of second samples corresponding to the first target sample, respectively; wherein the random numbers of the plurality of second samples corresponding to the same first target sample are uniformly distributed;

a third determining unit, configured to determine, as a first pseudo tag, a pseudo tag of a second target sample in a plurality of second samples corresponding to the first target sample, where a random number corresponding to the second target sample is greater than a target probability, where the target probability is a probability of generating a pseudo tag by the first target sample;

a fourth determining unit, configured to determine, as a second pseudo tag, a pseudo tag of a third target sample in the plurality of second samples corresponding to the first target sample, where a random number corresponding to the third target sample is not greater than the target probability.

14. The apparatus of any one of claims 8-13, wherein the means for determining comprises:

the training sub-module is used for training a second initial model by adopting feature values corresponding to the candidate features and pseudo labels of the second samples, wherein the feature values are included in the second samples, and the pseudo labels are included in the second samples, so that a trained second target model is obtained; wherein the second target model has learned the importance of a plurality of the candidate features during and/or after training;

and the processing submodule is used for inputting at least one second sample into the second target model so as to obtain the importance of the candidate features in the corresponding first sample.

15. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-7.

16. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-7.

17. A computer program product comprising a computer program which, when executed by a processor, carries out the steps of the method of any one of claims 1-7.