WO2021174723A1

WO2021174723A1 - Training sample expansion method and apparatus, electronic device, and storage medium

Info

Publication number: WO2021174723A1
Application number: PCT/CN2020/098246
Authority: WO
Inventors: 朱昭苇; 孙行智; 胡岗
Original assignee: 平安科技（深圳）有限公司
Priority date: 2020-03-02
Filing date: 2020-06-24
Publication date: 2021-09-10
Also published as: CN111461168A

Abstract

A training sample expansion method and apparatus, an electronic device, and a storage medium. The method comprises: determining, when the number of the samples of target disease types is less than a preset quantity threshold value, the samples as target samples; performing vector conversion on a disease name corresponding to each target sample so as to obtain a name vector; according to a pre-trained first disease classification model, based on the precision of the first disease classification model and the gradient change of a discrimination network, training a generation network to obtain a trained generation model; inputting the name vector into the trained generation model so as to obtain a generation sample data set; and if a plurality of generation samples in the generation sample data set may be used for model training, determining a real sample data set and the generation sample data set as a first training sample data set of an auxiliary diagnosis model. The method can expand the number of training samples, and improve the accuracy of the auxiliary diagnosis model.

Description

Training sample expansion method, device, electronic equipment and storage medium

This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on March 2, 2020. The application number is 202010136917.X. The invention title is "Training Sample Expansion Method, Device, Electronic Equipment, and Storage Medium". The entire content is approved. The reference is incorporated in this application.

Technical field

This application relates to the field of artificial intelligence technology, and in particular to a training sample expansion method, device, electronic equipment, and storage medium.

Background technique

At present, with the development of artificial intelligence technology, more and more auxiliary diagnosis models for auxiliary diagnosis have emerged. These auxiliary diagnosis models can provide great convenience for medical work. However, the inventor realizes that in the sample data set used to train the auxiliary diagnosis model, because some types of diseases are relatively rare, there may be cases where a certain type of disease symptom samples are less in number, and a smaller number of disease symptoms are used. Using samples to train the auxiliary diagnosis model will cause the accuracy of the trained auxiliary diagnosis model to be low.

Therefore, how to expand the number of training samples to improve the accuracy of the auxiliary diagnosis model is a technical problem that needs to be solved urgently.

Summary of the invention

In view of the above, it is necessary to provide a training sample expansion method, device, electronic equipment, and storage medium, which can expand the number of training samples to improve the accuracy of the auxiliary diagnosis model.

The first aspect of the present application provides a training sample expansion method, the method includes:

When the auxiliary diagnosis model needs to be trained, a real sample data set is obtained, where the real sample data set is composed of samples of multiple disease types, and each of the disease types includes at least one disease symptom;

When the number of samples of the target disease type in the samples of the multiple disease types is less than the preset number threshold, determining the sample of the target disease type as the target sample;

Using a pre-trained conversion network, vector conversion is performed on the disease name corresponding to the target sample to obtain a name vector;

Training the generation network according to the pre-trained first disease classification model, based on the accuracy of the first disease classification model and the gradient change of the discrimination network, to obtain the trained generation model;

Inputting the name vector into the trained generation model to obtain a generated sample data set, the disease type of the multiple generated samples included in the generated sample data set is consistent with the target disease type;

Using the first disease classification model to determine whether multiple generated samples in the generated sample data set can be used for model training according to the accuracy of the first disease classification model;

If multiple generated samples in the generated sample data set can be used for model training, the real sample data set and the generated sample data set are determined as the first training sample data set of the auxiliary diagnosis model.

A second aspect of the present application provides an electronic device including a processor and a memory, and the processor is configured to execute computer-readable instructions stored in the memory to implement the following steps:

A third aspect of the present application provides a computer-readable storage medium having at least one computer-readable instruction stored thereon, and the at least one computer-readable instruction is executed by a processor to implement the following steps:

A fourth aspect of the present application provides a training sample expansion device, the device includes:

The obtaining module is used to obtain a real sample data set when it is necessary to train an auxiliary diagnosis model, wherein the real sample data set is composed of samples of multiple disease types, and each of the samples of the disease type includes at least one disease symptom;

The determining module is configured to determine the sample of the target disease type as the target sample when the number of samples of the target disease type in the samples of the multiple disease types is less than a preset number threshold;

The conversion module is used to perform vector conversion of the disease name corresponding to the target sample through a pre-trained conversion network to obtain a name vector;

The training module is used to train the generation network based on the pre-trained first disease classification model, based on the accuracy of the first disease classification model and the gradient change of the discrimination network, to obtain a trained generation model;

The input module is configured to input the name vector into the trained generation model to obtain a generated sample data set, the disease type of the multiple generated samples included in the generated sample data set is consistent with the target disease type;

A judging module, configured to use the first disease classification model to determine whether multiple generated samples in the generated sample data set can be used for model training according to the accuracy of the first disease classification model;

The determining module is further configured to, if multiple generated samples in the generated sample data set are available for model training, determine the real sample data set and the generated sample data set as the first training of the auxiliary diagnosis model Sample data set.

From the above technical solutions, in this application, a small number of target samples can be determined, and then according to the first disease classification model, the generation network can be trained based on the accuracy of the first disease classification model and the gradient change of the discrimination network. Obtain a trained generative model, use the generative model to generate multiple generated samples consistent with the target disease type, thereby increasing the number of samples of the target disease type, and determine whether the multiple generated samples are generated by the first disease classification model It can be used for model training. If multiple generated samples can be used for model training, multiple generated samples can be added to the training sample data set, which expands the number of samples used to train the auxiliary diagnosis model and improves the accuracy of the auxiliary diagnosis model. This application can be applied to the technical fields of digital medical care such as smart medical care, precision medical care, and AI+ medical care, and can promote the development of digital medical care.

Description of the drawings

Fig. 1 is a flowchart of a preferred embodiment of a training sample expansion method disclosed in the present application.

Fig. 2 is a functional block diagram of a preferred embodiment of a training sample expansion device disclosed in the present application.

FIG. 3 is a schematic structural diagram of an electronic device implementing a preferred embodiment of the training sample expansion method according to the present application.

Detailed ways

The training sample expansion method of the embodiment of the present application is applied to an electronic device, and can also be applied to a hardware environment composed of an electronic device and a server connected to the electronic device through a network, and is executed by the server and the electronic device. Networks include, but are not limited to: wide area networks, metropolitan area networks, or local area networks.

Please refer to FIG. 1. FIG. 1 is a flowchart of a preferred embodiment of a training sample expansion method disclosed in the present application. Among them, according to different needs, the order of the steps in the flowchart can be changed, and some steps can be omitted.

S11. When the auxiliary diagnosis model needs to be trained, the electronic device obtains a real sample data set, where the real sample data set is composed of samples of multiple disease types, and each of the disease types includes at least one disease symptom.

Wherein, the auxiliary diagnosis model may be a model for auxiliary disease diagnosis (for example, a disease classification model, etc.).

Wherein, the real sample data set may be real case data, and the samples of each disease type may be composed of disease names and corresponding symptom combinations.

S12. When the number of samples of the target disease type in the samples of the multiple disease types is less than a preset number threshold, the electronic device determines the samples of the target disease type as the target sample.

In the embodiment of this application, a quantity threshold can be set in advance. When the number of samples of a certain disease type is smaller than this quantity threshold, because there are not enough samples, the auxiliary diagnosis model obtained by training with samples of the disease type is used. The accuracy of may not be high. Therefore, it is necessary to expand the samples of the disease type to increase the number of samples of the disease type, which can improve the accuracy of the trained auxiliary diagnosis model.

S13. The electronic device performs vector conversion on the disease name corresponding to the target sample through a pre-trained conversion network to obtain a name vector.

Wherein, the conversion network can convert words into a set of vector representations, and the conversion network can be obtained by CBOW (continuous-bag-of-words) training.

In the embodiment of this application, a pre-trained conversion network can be used to perform vector conversion of the disease name corresponding to the target sample to obtain a vector (name vector) of the disease name, for example, "gout" is represented by vector conversion It is [-0.124,-0.871,0.812,-1.290,...].

S14. The electronic device trains the generation network based on the pre-trained first disease classification model, based on the accuracy of the first disease classification model and the gradient change of the discrimination network, to obtain a trained generation model.

Wherein, the first disease classification model can output the disease type to which the disease symptoms belong according to the input disease symptoms.

Wherein, the generative network and the discriminant network jointly form a generative adversarial network (Generative Adversarial Net, GAN), wherein the generative adversarial network is a Generative Model based on an adversarial training process. A new deep learning framework. The purpose of the training of the generative adversarial network is to make the distribution of the generated generated samples and the real samples as close as possible, so as to be able to interpret the real data. In the training process, a generative model G is trained to generate realistic generated samples from random noise or latent variables (Latent Variable), and a discriminant model D is trained to identify real samples (ie input samples) and generated samples at the same time. In GAN training, the generative model G and the discriminant model D are trained at the same time. After multiple trainings, until a Nash equilibrium is reached, the generated samples generated by the generative model G are indistinguishable from the real samples. The discriminant model D cannot correctly distinguish the generated samples from the real samples.

As an optional implementation manner, after step S13, the method further includes:

Determining the dimension of the name vector as the dimension of the input array of the generating network;

The number of all symptoms in the disease symptom relation database corresponding to the name vector is determined as the dimensional size of the output array of the generating network, and the preset value is determined as the value of the element of the output array of the generating network.

According to the pre-trained first disease classification model, training the generation network based on the accuracy of the first disease classification model and the gradient change of the discrimination network, and obtaining the trained generation model includes:

According to the pre-trained first disease classification model, based on the accuracy of the first disease classification model and the gradient change of the discrimination network, according to the dimension of the input array of the generation network, the dimension of the output array, and the output The values of the elements of the array are trained on the generative network to obtain a trained generative model.

In this alternative embodiment, after vector conversion is performed on the disease name corresponding to the target sample to obtain the name vector, the generation network of the generation confrontation network and the discrimination network can be preprocessed, and the input array of the generation network can be specified first The dimension of the name vector is the same as the dimension of the name vector transformed by the transformation network. The dimension of the output array of the generation network is the number of all symptoms in the disease symptom relation database, that is, the number of elements in the output array of the generation network Is the number of all symptoms in the disease symptom relation database, and the preset value is determined as the value of the element of the output array of the generating network, for example, the specified value can only be 0 or 1. Then the parameters of the generated network can be randomly initialized.

Optionally, the training process of the confrontation generation network may be to first use the generation network to generate a batch of fake data with a label of 0. The fake data and the real data (labeled 1) are mixed into the discriminant network, and the parameters of the discriminant network are updated according to the results. Fix the discriminant network, use the generative network again to generate fake data, with a label of 1, enter the discriminant network together with the real data, and update the generated network parameters according to the discriminant network output results. This is repeated iteratively until the generation network and the discriminant network reach the Nash equilibrium.

Among them, the input of the discriminant network is the sequence string of the combination of symptoms and diseases and the label of the sequence combination (the labels are 0 and 1, indicating that they are derived from the generated network and the real data, respectively). For example, there are 10 symptoms and 2 diseases in the current system, and the sequence string length is 10+2=12. A real data symptom combination includes three symptoms and they are located at positions 1, 3, and 5. The disease corresponding to the symptom combination is located at position 1. The sequence string is represented as [1,0,1,0,1,0,0,0 ,0,0,1,0]. The input of the discriminant network is expressed as {[1,0,1,0,1,0,0,0,0,0,1,0],1}.

Wherein, the conversion network is trained using complete entity nouns as input, the output value of the discrimination network is a floating point number within a preset numerical range, and the output value is used to measure that the input of the discrimination network is false data The probability. The complete entity noun retains the complete meaning of the disease entity, which can prevent the entity from being split and destroy the meaning of the word itself; the output of the discrimination network can be a floating point number between 0-1, and the smaller the value, the more the input is considered by the discrimination network. It may be fake data.

Specifically, according to the pre-trained first disease classification model, training the generation network based on the accuracy of the first disease classification model and the gradient change of the discrimination network, and obtaining the trained generation model includes:

Using a generation network to generate multiple fake samples with disease types consistent with the target disease types;

Determining the plurality of fake samples and the second training sample data set as a fourth training sample data set;

Training the fourth training sample data set to obtain a third disease classification model;

Determining the third precision of the third disease classification model;

According to the third accuracy and the gradient change of the discrimination network, the parameters of the generation network are updated to obtain a trained generation model.

In this alternative embodiment, the output of the generating network is a sequence string (array), which is false disease symptom relationship data, that is, a false sample. The false sample and the second training sample data set can be determined as the fourth training The sample data set is used to train the fourth training sample data set to obtain a third disease classification model; the same test data set can be used to determine the accuracy (third accuracy) of the third disease classification model, according to the The third precision and the gradient change of the discrimination network are updated, and the parameters of the generation network are updated to obtain a trained generation model.

Specifically, the updating the parameters of the generation network according to the third accuracy and the gradient change of the discrimination network to obtain a trained generation model includes:

Determine the accuracy change rate according to the third accuracy and the first accuracy;

Obtaining a second gradient change according to the accuracy change rate and the first gradient change of the discrimination network;

Through the back propagation algorithm, according to the second gradient change, the parameters of the generation network are updated to obtain a trained generation model.

In this optional implementation, the difference between the third accuracy and the first accuracy may be divided by the third accuracy to obtain the accuracy change rate. The accuracy change rate can be combined with the first gradient change of the discriminant network to obtain a second gradient change, wherein the accuracy change rate is recorded as PR, and the first gradient change of the discriminant network is recorded as D, G , The second gradient change is denoted as D _new , G, arg min means looking for a parameter to minimize the value, ε is the expectation, z is the constant controlling the parameter distribution, q(z) means the parameter distribution, D(G(z)) Represents the output of the judgment network when generating good data generated by the network, D(G _ng (z)) represents the output of the judgment network when the generation network generates bad data, according to the accuracy change rate and the first gradient of the discrimination network Change, the formula for obtaining the second gradient change is:

D _new ,G=PR*log((D,G))+(1-PR)*log(1-(D,G));

By combining the accuracy change rate, it can be determined whether the modification direction of the network parameters is correct, and the training speed of the generated confrontation network is improved.

Wherein, the loss function of the discrimination network is a cross-entropy loss function.

The discriminant network is a supervised discriminant network. The loss function of the discriminant network is cross-entropy loss. During the training process, the back propagation method is used according to the result of the current classification, and the discriminant network parameters are updated according to the gradient descent direction. The task of generating the network is to find the optimal parameters that can describe the true distribution. The update of the parameters also uses the back propagation method, and the direction of the gradient change comes from the gradient passed by the discriminating network. Among them, Nash equilibrium is V(D,G), p _data (x) is the distribution of real sample data input to the discriminant network, and p _z (z) is the distribution of fake sample data input to the discriminant network, generating the network and discriminant network population The optimization formula is:

S15. The electronic device inputs the name vector to the trained generation model to obtain a generated sample data set. The disease types of the multiple generated samples included in the generated sample data set are consistent with the target disease types.

In the embodiment of the present application, the trained generation model may be used to generate multiple generated samples consistent with the target disease type.

S16. The electronic device uses the first disease classification model to determine whether multiple generated samples in the generated sample data set can be used for model training according to the accuracy of the first disease classification model; if so, perform step S17, if not , End this process.

In the embodiment of the present application, in order to ensure that the multiple generated samples in the generated sample data set can be used for model training, the first disease classification model needs to be used to judge the multiple generated samples to ensure the validity of the generated sample data set.

Specifically, the first disease classification model is trained using a second training sample data set, and the first disease classification model is used to determine the generated sample data set according to the accuracy of the first disease classification model Whether the multiple generated samples can be used for model training include:

Determining the first accuracy of the first disease classification model according to the test data set;

Determining the plurality of generated samples and the second training sample data set as a third training sample data set;

Training the third training sample data set to obtain a second disease classification model;

Determine the second accuracy of the second disease classification model according to the test data set;

Determine whether the second accuracy is greater than the first accuracy;

If the second accuracy is greater than the first accuracy, determining that the multiple generated samples can be used for model training; or

If the second accuracy is less than or equal to the first accuracy, it is determined that the multiple generated samples cannot be used for model training.

In this alternative embodiment, the first disease classification model can be used to judge the symptoms in the test data set, obtain disease classification results, and count the correct disease classification results and incorrect disease classifications of the first disease classification model. As a result, according to the statistical results, the correct rate (ie, the first accuracy) of the disease classification of the first disease classification model is determined. The plurality of generated samples and the second training sample data set may be determined as a third training sample data set, the third training sample data set is used to train a second disease classification model, and the second disease classification model may be used The classification model judges the symptoms in the test data set, obtains disease classification results, counts the correct disease classification results and incorrect disease classification results of the second disease classification model, and then determines the second disease classification model based on the statistical results The correct rate (ie, the second accuracy), and then determine whether the second accuracy is greater than the first accuracy, if the second accuracy is greater than the first accuracy, it means that after the training of the multiple generated samples is added, it is obtained The second disease classification model has higher accuracy than the first disease classification model for training without increasing the multiple generated samples, that is, the multiple generated samples can be used for model training, if the second accuracy is less than Or equal to the first accuracy, indicating that the second disease classification model obtained after training with the multiple generated samples is no more accurate than the first disease classification model trained with the multiple generated samples Even if the accuracy of the second disease classification model is lower than that of the first disease classification model, that is, the multiple generated samples cannot be used for model training.

S17. The electronic device determines the real sample data set and the generated sample data set as the first training sample data set of the auxiliary diagnosis model.

In the embodiment of the present application, if multiple generated samples in the generated sample data set can be used for model training, the generated sample data set and the formal sample data set can be used for model training together to ensure a sufficient number of samples , Improve the accuracy of the trained model.

In the method flow described in Figure 1, a small number of target samples can be determined, and then according to the first disease classification model, the generation network is trained based on the accuracy of the first disease classification model and the gradient change of the discriminant network. A trained generative model can be obtained, and the generative model can be used to generate multiple generated samples consistent with the target disease type, thereby increasing the number of samples of the target disease type, and the multiple generated samples can be judged by the first disease classification model Whether it can be used for model training, if multiple generated samples can be used for model training, multiple generated samples can be added to the training sample data set, which expands the number of samples used to train the auxiliary diagnosis model and improves the accuracy of the auxiliary diagnosis model.

This application can be applied to the technical fields of digital medical care such as smart medical care, precision medical care, and AI+ medical care, and can promote the development of digital medical care.

The above are only specific implementations of this application, but the scope of protection of this application is not limited to this. For those of ordinary skill in the art, without departing from the creative concept of this application, they can also make Improvements, but these all belong to the scope of protection of this application.

Please refer to FIG. 2. FIG. 2 is a functional module diagram of a preferred embodiment of a training sample expansion device disclosed in the present application.

In some embodiments, the training sample expansion device runs in an electronic device. The training sample expansion device may include multiple functional modules composed of program code segments, and the program is a series of computer-readable instruction codes. The program code of each program segment in the training sample expansion device can be stored in a memory and executed by at least one processor to execute part or all of the steps in the training sample expansion method described in FIG. 1. For details, please refer to The related description in the method shown in FIG. 1 will not be repeated here.

In this embodiment, the training sample expansion device can be divided into multiple functional modules according to the functions it performs. The functional modules may include: an acquisition module 201, a determination module 202, a conversion module 203, a training module 204, an input module 205, and a judgment module 206. The module referred to in this application refers to a series of computer-readable instruction segments that can be executed by at least one processor and can complete fixed functions, and are stored in a memory.

The obtaining module 201 is configured to obtain a real sample data set when the auxiliary diagnosis model needs to be trained, wherein the real sample data set is composed of samples of multiple disease types, and each of the disease types includes at least one disease symptom .

The determining module 202 is configured to determine the sample of the target disease type as the target sample when the number of samples of the target disease type in the samples of the multiple disease types is less than a preset number threshold.

In the embodiment of this application, a quantity threshold can be set in advance. When the number of samples of a certain disease type is smaller than this number threshold, because there are not enough samples, the auxiliary diagnosis obtained by training with samples of the disease type The accuracy of the model may not be high. Therefore, it is necessary to expand the samples of the disease type to increase the number of samples of the disease type, which can improve the accuracy of the trained auxiliary diagnosis model.

The conversion module 203 is configured to perform vector conversion of the disease name corresponding to the target sample through a pre-trained conversion network to obtain a name vector.

The training module 204 is configured to train the generation network based on the pre-trained first disease classification model, based on the accuracy of the first disease classification model and the gradient change of the discrimination network, to obtain a trained generation model.

Wherein, the generative network and the discriminant network jointly form a generative adversarial network (Generative Adversarial Net, GAN), where the generative adversarial network is a Generative Model based on an adversarial training process. A new deep learning framework. The purpose of the training of the generative adversarial network is to make the distribution of the generated generated samples and the real samples as close as possible, so as to be able to interpret the real data. In the training process, a generative model G is trained to generate realistic generated samples from random noise or latent variables (Latent Variable), and a discriminant model D is trained to identify real samples (ie input samples) and generated samples at the same time. In GAN training, the generative model G and the discriminant model D are trained at the same time. After multiple trainings, until a Nash equilibrium is reached, the generated samples generated by the generative model G are indistinguishable from the real samples. The discriminant model D cannot correctly distinguish the generated samples from the real samples.

The input module 205 is configured to input the name vector into the trained generation model to obtain a generated sample data set, and the generated sample data set includes the multiple generated samples whose disease types are consistent with the target disease types.

The judging module 206 is configured to use the first disease classification model to determine whether multiple generated samples in the generated sample data set can be used for model training according to the accuracy of the first disease classification model.

The determining module 202 is further configured to, if multiple generated samples in the generated sample data set can be used for model training, determine the real sample data set and the generated sample data set as the first of the auxiliary diagnosis model Training sample data set.

As an optional implementation manner, the first disease classification model is trained using a second training sample data set, and the judgment module 206 uses the first disease classification model according to the data set of the first disease classification model. Accuracy, the method for judging whether multiple generated samples in the generated sample data set can be used for model training is specifically:

Determine whether the second accuracy is greater than the first accuracy;

As an optional implementation manner, the determining module 202 is also used for the conversion module 203 to perform vector conversion of the disease name corresponding to the target sample through a pre-trained conversion network, and after obtaining the name vector, The dimension of the name vector is determined to be the dimension of the input array of the generating network;

The determining module 202 is further configured to determine the number of all symptoms in the disease symptom relation database corresponding to the name vector as the dimension of the output array of the generating network, and determining the preset value as the size of the generating network The value of the elements of the output array;

The training module 204 trains the generation network based on the pre-trained first disease classification model, based on the accuracy of the first disease classification model and the gradient change of the discrimination network, and the specific method for obtaining the trained generation model is as follows:

As an optional implementation manner, the training module 204 includes:

A generating sub-module is used to generate a plurality of fake samples whose disease type is consistent with the target disease type using the generating network;

A determining sub-module, configured to determine the plurality of fake samples and the second training sample data set as a fourth training sample data set;

The training sub-module is used to train the fourth training sample data set to obtain a third disease classification model;

The determining sub-module is also used to determine the third accuracy of the third disease classification model;

The update sub-module is used to update the parameters of the generation network according to the third accuracy and the gradient change of the discrimination network to obtain a trained generation model.

As an optional implementation manner, the update submodule updates the parameters of the generation network according to the third accuracy and the gradient change of the discrimination network, and the specific method for obtaining the trained generation model is as follows:

D _new ,G=PR*log((D,G))+(1-PR)*log(1-(D,G));

In the training sample expansion device described in FIG. 2, a small number of target samples can be determined, and then according to the first disease classification model, based on the accuracy of the first disease classification model and the gradient change of the discriminating network, the generation network is performed Training can obtain a well-trained generative model, use the generative model to generate multiple generated samples consistent with the target disease type, thereby increasing the number of samples of the target disease type, and judge multiple samples by the first disease classification model Whether the generated samples can be used for model training, if multiple generated samples can be used for model training, multiple generated samples can be added to the training sample data set, which expands the number of samples used to train the auxiliary diagnosis model and improves the accuracy of the auxiliary diagnosis model Spend.

As shown in FIG. 3, FIG. 3 is a schematic structural diagram of an electronic device implementing a preferred embodiment of the training sample expansion method of the present application. The electronic device 3 includes a memory 31, at least one processor 32, computer readable instructions 33 stored in the memory 31 and executable on the at least one processor 32, and at least one communication bus 34.

Those skilled in the art can understand that the schematic diagram shown in FIG. 3 is only an example of the electronic device 3, and does not constitute a limitation on the electronic device 3. It may include more or less components than those shown in the figure, or a combination. Certain components, or different components, for example, the electronic device 3 may also include input and output devices, network access devices, and so on.

The electronic device 3 also includes, but is not limited to, any electronic product that can interact with the user through a keyboard, a mouse, a remote control, a touch panel, or a voice control device, for example, a personal computer, a tablet computer, a smart phone, etc. Personal digital assistants (Personal Digital Assistant, PDA), game consoles, interactive network television (Internet Protocol Television, IPTV), smart wearable devices, etc.

The at least one processor 32 may be a central processing unit (Central Processing Unit, CPU), or other general-purpose processors, digital signal processors (Digital Signal Processors, DSPs), and application specific integrated circuits (ASICs). ), Field-Programmable Gate Array (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc. The processor 32 can be a microprocessor, or the processor 32 can also be any conventional processor, etc. The processor 32 is the control center of the electronic device 3, and connects the entire electronic device 3 through various interfaces and lines. Parts.

The memory 31 may be used to store the computer-readable instructions 33 and/or modules/units, and the processor 32 runs or executes the computer-readable instructions and/or modules/units stored in the memory 31, and The data stored in the memory 31 is called to realize various functions of the electronic device 3. The memory 31 may mainly include a storage program area and a storage data area. The storage program area may store an operating system, an application program required by at least one function (such as a sound playback function, an image playback function, etc.), etc.; the storage data area may Data and the like created in accordance with the use of the electronic device 3 are stored. In addition, the memory 31 may include volatile memory such as high-speed random access memory, and may also include non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a smart media card (SMC), and a secure digital ( Secure Digital, SD card, Flash Card, at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device.

With reference to FIG. 1, the memory 31 in the electronic device 3 stores multiple instructions to implement a training sample expansion method, and the processor 32 can execute the multiple instructions to achieve:

Specifically, for the specific implementation method of the above-mentioned instructions by the processor 32, reference may be made to the description of the relevant steps in the embodiment corresponding to FIG. 1, which will not be repeated here.

In the electronic device 3 described in FIG. 3, a small number of target samples can be determined, and then the generation network can be trained based on the accuracy of the first disease classification model and the gradient change of the discrimination network according to the first disease classification model. , A well-trained generative model can be obtained, and the generative model can be used to generate a plurality of samples consistent with the target disease type, thereby increasing the number of samples of the target disease type, and the first disease classification model is used to determine the number of generated samples. Whether the sample can be used for model training, if multiple generated samples can be used for model training, multiple generated samples can be added to the training sample data set, which expands the number of samples used to train the auxiliary diagnosis model and improves the accuracy of the auxiliary diagnosis model .

If the integrated module/unit of the electronic device 3 is implemented in the form of a software functional unit and sold or used as an independent product, it may be stored in a computer-readable storage medium, which may be non-easy. A volatile storage medium can also be a volatile storage medium. Based on this understanding, this application implements all or part of the processes in the above-mentioned embodiments and methods, and can also be completed by instructing relevant hardware through computer-readable instructions, and the computer-readable instructions can be stored in a computer-readable storage medium. When the computer-readable instructions are executed by the processor, they can implement the steps of the foregoing method embodiments. Wherein, the computer-readable instruction includes computer-readable instruction code, and the computer-readable instruction code may be in the form of source code, object code, executable file, or some intermediate form. The computer-readable medium may include: any entity or device capable of carrying the computer-readable instruction code, recording medium, U disk, mobile hard disk, magnetic disk, optical disk, computer memory, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory).

In the several embodiments provided in this application, it should be understood that the disclosed system, device, and method can be implemented in other ways. For example, the device embodiments described above are merely illustrative. For example, the division of the modules is only a logical function division, and there may be other division methods in actual implementation.

The modules described as separate components may or may not be physically separated, and the components displayed as modules may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the modules can be selected according to actual needs to achieve the objectives of the solutions of the embodiments.

In addition, the functional modules in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit. The above-mentioned integrated unit may be implemented in the form of hardware, or may be implemented in the form of hardware plus software functional modules.

For those skilled in the art, it is obvious that the present application is not limited to the details of the foregoing exemplary embodiments, and the present application can be implemented in other specific forms without departing from the spirit or basic characteristics of the application. Therefore, no matter from which point of view, the embodiments should be regarded as exemplary and non-limiting. The scope of this application is defined by the appended claims rather than the above description, and therefore it is intended to fall into the claims. All changes in the meaning and scope of the equivalent elements of are included in this application. Any associated diagram marks in the claims should not be regarded as limiting the claims involved. In addition, it is obvious that the word "including" does not exclude other units or steps, and the singular does not exclude the plural. Multiple units or devices stated in the system claims can also be implemented by one unit or device through software or hardware. The second class words are used to indicate names, and do not indicate any specific order.

Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the application and not to limit them. Although the application has been described in detail with reference to the preferred embodiments, those of ordinary skill in the art should understand that the technical solutions of the application can be Make modifications or equivalent replacements without departing from the spirit and scope of the technical solution of the present application.

Claims

A training sample expansion method, wherein the training sample expansion method includes:

When the auxiliary diagnosis model needs to be trained, a real sample data set is obtained, where the real sample data set is composed of samples of multiple disease types, and each of the disease types includes at least one disease symptom;

When the number of samples of the target disease type in the samples of the multiple disease types is less than the preset number threshold, determining the sample of the target disease type as the target sample;

Using a pre-trained conversion network, vector conversion is performed on the disease name corresponding to the target sample to obtain a name vector;

Training the generation network according to the pre-trained first disease classification model, based on the accuracy of the first disease classification model and the gradient change of the discrimination network, to obtain the trained generation model;

Inputting the name vector into the trained generation model to obtain a generated sample data set, the disease type of the multiple generated samples included in the generated sample data set is consistent with the target disease type;

Using the first disease classification model to determine whether multiple generated samples in the generated sample data set can be used for model training according to the accuracy of the first disease classification model;

If multiple generated samples in the generated sample data set can be used for model training, the real sample data set and the generated sample data set are determined as the first training sample data set of the auxiliary diagnosis model.
The training sample expansion method according to claim 1, wherein the first disease classification model is trained using a second training sample data set, and the first disease classification model is used according to the first disease classification The accuracy of the model, judging whether multiple generated samples in the generated sample data set can be used for model training includes:

Determining the first accuracy of the first disease classification model according to the test data set;

Determining the plurality of generated samples and the second training sample data set as a third training sample data set;

Training the third training sample data set to obtain a second disease classification model;

Determine the second accuracy of the second disease classification model according to the test data set;

Determine whether the second accuracy is greater than the first accuracy;

If the second accuracy is greater than the first accuracy, determining that the multiple generated samples can be used for model training; or

If the second accuracy is less than or equal to the first accuracy, it is determined that the multiple generated samples cannot be used for model training.
The training sample expansion method according to claim 1, wherein the disease name corresponding to the target sample is vector-converted through a pre-trained conversion network, and after the name vector is obtained, the training sample expansion method further comprises :

Determining the dimension of the name vector as the dimension of the input array of the generating network;

Determining the number of all symptoms in the disease symptom relation database corresponding to the name vector as the dimensional size of the output array of the generating network, and determining the preset value as the value of the element of the output array of the generating network;

According to the pre-trained first disease classification model, training the generation network based on the accuracy of the first disease classification model and the gradient change of the discrimination network, and obtaining the trained generation model includes:

According to the pre-trained first disease classification model, based on the accuracy of the first disease classification model and the gradient change of the discrimination network, according to the dimension of the input array of the generation network, the dimension of the output array, and the output The values of the elements of the array are trained on the generative network to obtain a trained generative model.
The training sample expansion method according to claim 2, wherein the generating network is trained based on the accuracy of the first disease classification model and the gradient change of the discriminant network according to the pre-trained first disease classification model, Obtaining a trained generative model includes:

Using a generation network to generate multiple fake samples with disease types consistent with the target disease types;

Determining the plurality of fake samples and the second training sample data set as a fourth training sample data set;

Training the fourth training sample data set to obtain a third disease classification model;

Determining the third precision of the third disease classification model;

According to the third accuracy and the gradient change of the discrimination network, the parameters of the generation network are updated to obtain a trained generation model.
The training sample expansion method according to claim 4, wherein the updating the parameters of the generation network according to the third accuracy and the gradient change of the discrimination network to obtain a trained generation model comprises:

Determine the accuracy change rate according to the third accuracy and the first accuracy;

Obtaining a second gradient change according to the accuracy change rate and the first gradient change of the discrimination network;

Through the back propagation algorithm, according to the second gradient change, the parameters of the generation network are updated to obtain a trained generation model.
The training sample expansion method according to any one of claims 1 to 5, wherein the loss function of the discriminant network is a cross-entropy loss function.
The training sample expansion method according to any one of claims 1 to 5, wherein the conversion network is trained using complete physical nouns as input, and the output value of the discrimination network is a preset numerical range The output value is used to measure the probability that the input of the discrimination network is false data.
An electronic device, wherein the electronic device includes a processor and a memory, and the processor is configured to execute at least one computer-readable instruction stored in the memory to implement the following steps:

When the auxiliary diagnosis model needs to be trained, a real sample data set is obtained, where the real sample data set is composed of samples of multiple disease types, and each of the disease types includes at least one disease symptom;

When the number of samples of the target disease type in the samples of the multiple disease types is less than the preset number threshold, determining the sample of the target disease type as the target sample;

Using a pre-trained conversion network, vector conversion is performed on the disease name corresponding to the target sample to obtain a name vector;

Training the generation network according to the pre-trained first disease classification model, based on the accuracy of the first disease classification model and the gradient change of the discrimination network, to obtain the trained generation model;

Inputting the name vector into the trained generation model to obtain a generated sample data set, the disease type of the multiple generated samples included in the generated sample data set is consistent with the target disease type;

Using the first disease classification model to determine whether multiple generated samples in the generated sample data set can be used for model training according to the accuracy of the first disease classification model;

If multiple generated samples in the generated sample data set can be used for model training, the real sample data set and the generated sample data set are determined as the first training sample data set of the auxiliary diagnosis model.
The electronic device according to claim 8, wherein the first disease classification model is trained using a second training sample data set, and the first disease classification model is used according to the first disease classification model. When determining whether multiple generated samples in the generated sample data set can be used for model training, the processor executes the at least one computer-readable instruction to implement the following steps:

Determining the first accuracy of the first disease classification model according to the test data set;

Determining the plurality of generated samples and the second training sample data set as a third training sample data set;

Training the third training sample data set to obtain a second disease classification model;

Determine the second accuracy of the second disease classification model according to the test data set;

Determine whether the second accuracy is greater than the first accuracy;

If the second accuracy is greater than the first accuracy, determining that the multiple generated samples can be used for model training; or

If the second accuracy is less than or equal to the first accuracy, it is determined that the multiple generated samples cannot be used for model training.
The electronic device according to claim 8, wherein, after the disease name corresponding to the target sample is vector-transformed through a pre-trained conversion network to obtain a name vector, the processor executes the at least one Computer readable instructions to achieve the following steps:

Determining the dimension of the name vector as the dimension of the input array of the generating network;

Determining the number of all symptoms in the disease symptom relation database corresponding to the name vector as the dimension size of the output array of the generating network, and determining the preset value as the value of the element of the output array of the generating network;

According to the pre-trained first disease classification model, training the generation network based on the accuracy of the first disease classification model and the gradient change of the discrimination network, and obtaining the trained generation model includes:

According to the pre-trained first disease classification model, based on the accuracy of the first disease classification model and the gradient change of the discrimination network, according to the dimension of the input array of the generation network, the dimension of the output array, and the output The values of the elements of the array are trained on the generative network to obtain a trained generative model.
The electronic device according to claim 9, wherein, in the first disease classification model trained in advance, the generation network is trained based on the accuracy of the first disease classification model and the gradient change of the discriminant network to obtain When the generated model is trained, the processor executes the at least one computer-readable instruction to implement the following steps:

Using a generation network to generate multiple fake samples with disease types consistent with the target disease types;

Determining the plurality of fake samples and the second training sample data set as a fourth training sample data set;

Training the fourth training sample data set to obtain a third disease classification model;

Determining the third precision of the third disease classification model;

According to the third accuracy and the gradient change of the discrimination network, the parameters of the generation network are updated to obtain a trained generation model.
The electronic device according to claim 11, wherein, when the parameters of the generation network are updated according to the third accuracy and the gradient change of the discrimination network to obtain a trained generation model, the processor The at least one computer readable instruction is executed to implement the following steps:

Determine the accuracy change rate according to the third accuracy and the first accuracy;

Obtaining a second gradient change according to the accuracy change rate and the first gradient change of the discrimination network;

Through the back propagation algorithm, according to the second gradient change, the parameters of the generation network are updated to obtain a trained generation model.
The electronic device according to any one of claims 8 to 12, wherein the loss function of the discrimination network is a cross-entropy loss function.
A computer-readable storage medium, wherein the computer-readable storage medium stores at least one computer-readable instruction, and when the at least one computer-readable instruction is executed by a processor, the following steps are implemented:

When the auxiliary diagnosis model needs to be trained, a real sample data set is obtained, where the real sample data set is composed of samples of multiple disease types, and each of the disease types includes at least one disease symptom;

When the number of samples of the target disease type in the samples of the multiple disease types is less than the preset number threshold, determining the sample of the target disease type as the target sample;

Using a pre-trained conversion network, vector conversion is performed on the disease name corresponding to the target sample to obtain a name vector;

Training the generation network according to the pre-trained first disease classification model, based on the accuracy of the first disease classification model and the gradient change of the discrimination network, to obtain the trained generation model;

Inputting the name vector into the trained generation model to obtain a generated sample data set, the disease type of the multiple generated samples included in the generated sample data set is consistent with the target disease type;

Using the first disease classification model to determine whether multiple generated samples in the generated sample data set can be used for model training according to the accuracy of the first disease classification model;

If multiple generated samples in the generated sample data set can be used for model training, the real sample data set and the generated sample data set are determined as the first training sample data set of the auxiliary diagnosis model.
The storage medium according to claim 14, wherein the first disease classification model is trained using a second training sample data set, and the first disease classification model is used according to the first disease classification model. When judging whether multiple generated samples in the generated sample data set can be used for model training, the at least one computer-readable instruction is executed by the processor to implement the following steps:

Determining the first accuracy of the first disease classification model according to the test data set;

Determining the plurality of generated samples and the second training sample data set as a third training sample data set;

Training the third training sample data set to obtain a second disease classification model;

Determine the second accuracy of the second disease classification model according to the test data set;

Determine whether the second accuracy is greater than the first accuracy;

If the second accuracy is greater than the first accuracy, determining that the multiple generated samples can be used for model training; or

If the second accuracy is less than or equal to the first accuracy, it is determined that the multiple generated samples cannot be used for model training.
The storage medium according to claim 14, wherein, after the disease name corresponding to the target sample is vector-transformed through a pre-trained conversion network to obtain a name vector, the at least one computer-readable instruction is The processor executes to achieve the following steps:

Determining the dimension of the name vector as the dimension of the input array of the generating network;

Determining the number of all symptoms in the disease symptom relation database corresponding to the name vector as the dimensional size of the output array of the generating network, and determining the preset value as the value of the element of the output array of the generating network;

According to the pre-trained first disease classification model, training the generation network based on the accuracy of the first disease classification model and the gradient change of the discrimination network, and obtaining the trained generation model includes:

According to the pre-trained first disease classification model, based on the accuracy of the first disease classification model and the gradient change of the discrimination network, according to the dimension of the input array of the generation network, the dimension of the output array, and the output The values of the elements of the array are trained on the generative network to obtain a trained generative model.
The storage medium according to claim 15, wherein in the first disease classification model trained in advance, the generation network is trained based on the accuracy of the first disease classification model and the gradient change of the discriminant network to obtain When the generated model is trained, the at least one computer-readable instruction is executed by the processor to implement the following steps:

Using a generation network to generate multiple fake samples with disease types consistent with the target disease types;

Determining the plurality of fake samples and the second training sample data set as a fourth training sample data set;

Training the fourth training sample data set to obtain a third disease classification model;

Determining the third precision of the third disease classification model;

According to the third accuracy and the gradient change of the discrimination network, the parameters of the generation network are updated to obtain a trained generation model.
The storage medium according to claim 17, wherein, when the parameters of the generation network are updated according to the third accuracy and the gradient change of the discrimination network to obtain a trained generation model, the at least one The computer-readable instructions are also used to implement the following steps when executed by the processor:

Determine the accuracy change rate according to the third accuracy and the first accuracy;

Obtaining a second gradient change according to the accuracy change rate and the first gradient change of the discrimination network;

Through the back propagation algorithm, according to the second gradient change, the parameters of the generation network are updated to obtain a trained generation model.
The storage medium according to any one of claims 14 to 18, wherein the loss function of the discriminant network is a cross-entropy loss function.
A training sample expansion device, wherein the training sample expansion device includes:

The obtaining module is used to obtain a real sample data set when it is necessary to train an auxiliary diagnosis model, wherein the real sample data set is composed of samples of multiple disease types, and each of the samples of the disease type includes at least one disease symptom;

The determining module is configured to determine the sample of the target disease type as the target sample when the number of samples of the target disease type in the samples of the multiple disease types is less than a preset number threshold;

The conversion module is used to perform vector conversion of the disease name corresponding to the target sample through a pre-trained conversion network to obtain a name vector;

The training module is used to train the generation network based on the pre-trained first disease classification model, based on the accuracy of the first disease classification model and the gradient change of the discrimination network, to obtain a trained generation model;

The input module is configured to input the name vector into the trained generation model to obtain a generated sample data set, the disease type of the multiple generated samples included in the generated sample data set is consistent with the target disease type;

A judging module, configured to use the first disease classification model to determine whether multiple generated samples in the generated sample data set can be used for model training according to the accuracy of the first disease classification model;

The determining module is further configured to determine the real sample data set and the generated sample data set as the first training of the auxiliary diagnosis model if multiple generated samples in the generated sample data set can be used for model training Sample data set.