Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The terms "first," "second," "third," "fourth," and the like in the description and in the claims of the present application and in the above-described drawings (if any) are used for distinguishing between similar items and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances, in other words that the embodiments described are to be practiced in sequences other than those illustrated or described herein. Moreover, the terms "comprises," "comprising," and any other variation thereof, may also include other things, such as processes, methods, systems, articles, or apparatus that comprise a list of steps or elements is not necessarily limited to only those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such processes, methods, articles, or apparatus.
It should be noted that the description relating to "first", "second", etc. in the present invention is for descriptive purposes only and is not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In addition, technical solutions between various embodiments may be combined with each other, but must be realized by a person skilled in the art, and when the technical solutions are contradictory or cannot be realized, such a combination should not be considered to exist, and is not within the protection scope of the present invention.
Please refer to fig. 1, which is a flowchart illustrating a training method of an image classification model according to an embodiment of the present invention. The training method is used for training the image classification model, and the image classification model obtained through training can classify the label-free images. The training method of the image classification model specifically comprises the following steps.
Step S102, the first image is divided into several subsets. In the present embodiment, all the first images are randomly and uniformly divided into a plurality of subsets, and each first image corresponds to one subset. The uniform division may mean that the number of first images in each subset is equal, or that the number of first images in each subset is about the same, for example, the number of first images in each subset differs by 1-5. In this embodiment, the first images are labeled images, and each first image has a label. The first image includes a first original label, which is a label vector. All the first images belong to a plurality of categories, the number of the categories is a preset value, and the label of each first image corresponds to one category. For example, all first images belong to 5 categories, the 5 categories including A, B, C, D, E. The label of each first image corresponds to A, B, C, D, E one of five categories. If the label a of the first image corresponds to the category A, the first original label of the first image is (1, 0,0,0, 0); if the label D of the first image corresponds to the category D, the first original label of the first image is (0, 0,0,1, 0). It will be appreciated that the number of values in the first original label is the same as the number of categories, i.e. the preset values. And each value in the first original label corresponds to each category one to one. When the label of the first image corresponds to a certain category, the value corresponding to the category in the first original label of the first image is 1, and the rest values are 0. That is, the first original label is a one-hot vector.
And step S104, constructing a plurality of collections. Wherein each collection comprises a validation set and a training set in one-to-one correspondence. In each of the sets, the validation set is one of the subsets, and the training set includes the second image set and the remaining ones of the subsets. The second images in the second set of images are unlabeled images. In this embodiment, the number of subsets is 5-10. Accordingly, the number of collections is the same as the number of subsets, also 5-10. For example, the number of subsets is 5, including X1, X2, X3, X4, X5, and the second set of images is Y. If the verification set in the collection set is X1, the training set comprises X2, X3, X4, X5 and Y; if the verification set in the collection set is X2, the training set comprises X1, X3, X4, X5 and Y; if the verification set in the collection set is X3, the training set comprises X1, X2, X4, X5 and Y; if the verification set in the collection set is X4, the training set comprises X1, X2, X3, X5 and Y; if the verification set in the collection set is X5, the training set includes X1, X2, X3, X4 and Y. That is, the number of collections is also 5, and each training set includes the second image set.
After several collections are constructed, enhancement processing is performed on both the first image and the second image in each training set. Specifically, each first image and each second image are sequentially subjected to filling, random cropping and random horizontal inversion processing, so that corresponding first enhanced images and second enhanced images are obtained. And in the process of random cutting, cutting the filled first image and the filled second image into preset sizes. That is, the first enhanced image and the second enhanced image are both the same size. The random horizontal flipping is probabilistic horizontal flipping, that is, the randomly cropped first image and second image may or may not be horizontally flipped. In some possible embodiments, the random horizontal flipping may also be replaced by a random vertical flipping or weakening, etc.
And step S106, inputting each training set into a plurality of pairs of student networks and teacher networks respectively, and acquiring corresponding output results. In this embodiment, each first image of the same training set is input into the student network and corresponding first student results are obtained, and each second image of the same training set is input into the student network and the teacher network and corresponding second student results and second teacher results are obtained. It can be understood that the student networks and the teacher network correspond to each other one by one, and the number of pairs of the student networks and the teacher network is the same as the number of training sets. Each pair of student and teacher networks corresponds to a training set. Preferably, a first enhanced image in the same training set is input into a student network and a first student result is obtained, a second enhanced image in the same training set is input into a corresponding student network and a second student result is obtained, and a second enhanced image in the same training set is input into a corresponding teacher network and a second teacher result is obtained. The first student result, the second student result and the second teacher result are label vectors, the number of numerical values in the label vectors is the same as the number of categories, namely the preset value, and each numerical value in the label vectors corresponds to one category. And if the category corresponding to the maximum numerical value in the label vector is the category with the highest reliability, the label corresponding to the category is the predicted label.
In some possible embodiments, the second images in the training set may be subjected to two enhancement processes to form two second enhanced images, and the two second enhanced images are input into the corresponding student network and teacher network respectively.
And step S108, updating the first parameters of the student network and the second parameters of the teacher network according to the output result. It will be appreciated that obtaining the output results and updating the first parameters of the student network and the second parameters of the teacher network based on the output results is an iterative training process. After all the first images and the second images in the training set are respectively input into the student network and the teacher network and corresponding output results are obtained, the first parameters of the student network and the second parameters of the teacher network are updated according to the output results each time. A specific process of updating the first parameter of the student network and the second parameter of the teacher network according to the output result will be described in detail below.
And step S110, selecting the optimal student network formed by the corresponding student network in the process of updating the first parameter according to the verification set. In this embodiment, each time the first parameter of the student network is updated, the first image in the corresponding verification set is input into the student network and the corresponding verification result is obtained. The verification result is a label vector, the number of numerical values in the label vector is the same as the number of categories, namely the preset value, and each numerical value in the label vector corresponds to one category. And calculating the accuracy of the corresponding student network according to the verification result and the first original label. Specifically, after the first images in the verification set are adjusted to a preset size, each first image is input into a student network and a corresponding verification result is obtained, and whether the verification result is correct or not is judged according to a first original label of the first image. And when the label corresponding to the maximum numerical value in the verification result is the same as the label corresponding to the first original label, judging that the verification result is correct. And counting the correct times of the verification result of the first image in the same verification set, and calculating the accuracy of the student network according to the correct times and the number of the first images. And selecting the student network with the highest accuracy as the optimal student network. It can be understood that, after the first parameter of the student network is updated each time, the accuracy of the student network is calculated, and the accuracy obtained by the current calculation is compared with the accuracy obtained by the last calculation, so that the student network with higher accuracy is reserved. In this embodiment, when the number of updates of the first parameter reaches a preset number, the update is ended. Correspondingly, the student network with higher accuracy is reserved as the student network with the highest accuracy in the updating process, and then the student network is selected as the optimal student network.
And step S112, taking the optimal student networks as image classification models. It will be appreciated that each collection is trained on a pair of student and teacher networks, respectively. Accordingly, the number of optimal student networks is the same as the number of collections.
In some possible embodiments, before step S102 is performed, all the first images are divided into a first set and a second set, and then the first images in the second set are divided into several subsets. Wherein a first image of the first set is used to evaluate an accuracy of the image classification model. The evaluating the accuracy of the image classification model specifically comprises: and after the first image is adjusted to a preset size, inputting the first image into an image classification model and obtaining a corresponding evaluation result, and calculating the accuracy of the image classification model according to the evaluation result and the first original label of the first image. In this embodiment, the first image is subjected to scaling processing so that the size of the first image is a preset size. Wherein the number of the first images in the first set is 10-20% of the number of all the first images, and the number of the first images in the second set is 80-90% of the number of all the first images. Preferably, the number of first images in the first set is 20% of all first images, and the number of first images in the second set is 80% of all first images.
In the above embodiment, the first image, i.e., the labeled image, is divided into a plurality of subsets, and then the second image, i.e., the unlabeled image, is added to the subsets to form a plurality of training sets, which are respectively used for training a plurality of pairs of student networks and teacher networks. And then, an optimal student network is selected as an image classification model by utilizing a verification set corresponding to the training set, and an image classification model with good effect can be trained together by utilizing more unlabelled images under the condition of small number of labeled images through a semi-supervised deep learning method, so that the problem of low accuracy of the image classification model caused by small number of labeled images can be effectively solved, the performance of the image classification model is effectively improved, and the accuracy of classification prediction is increased. Meanwhile, the finally obtained image classification model has good robustness and can be suitable for various image classification tasks.
In addition, the first image is uniformly divided into a plurality of subsets, so that the importance of a plurality of trained optimal student networks can be effectively ensured to be consistent. The number of the divided subsets is 5-10, so that the ratio of the number of the verification sets to the number of the training sets is 1/10-1/5, and the training effect can be effectively ensured. And performing enhancement processing on the same second image twice respectively, and inputting the two obtained second enhanced images into a student network and a teacher network respectively, so that the generalization capability and robustness of the image classification model can be better improved according to the second student result and the second teacher result.
It can be understood that the training method of the image classification model can be used for training not only the image classification model, but also the classification models of voice, characters and the like. When the training method is used for training classification models such as voices or characters, the first image is correspondingly changed into the voice with the label or the characters with the label, and the second image is changed into the voice without the label or the characters without the label in the training process, so that the description is omitted.
Please refer to fig. 2, which is a sub-flowchart of a training method of an image classification model according to an embodiment of the present invention. Step S108 specifically includes the following steps.
Step S202, constructing a first loss according to the first student result and the first original label. In this embodiment, the cross entropy of the corresponding first student result and the first original label is calculated, and then the average value of all cross entropies corresponding to the same training set is used as the first loss. Specifically, a first loss is constructed using a first formula. WhereinThe first formula is
。
The first loss is represented by the first loss,
a first image is represented that is a first image,
representing the number of first images in a training set,
a first original label is represented that is,
the result of the first student is presented,
representing the cross entropy of the first student result and the first original label. It is understood that the cross entropy is the cross entropy of the first student result and the first original label of the same first image, each training set corresponding to a first penalty.
And step S204, constructing a second loss according to the second student result and the second teacher result. In this embodiment, the mean square errors of the corresponding second student result and second teacher result are calculated, and then the average value of all the mean square errors corresponding to the same training set is used as the second loss. Specifically, the second loss is constructed using a second formula. Wherein the second formula is
。
The second loss is represented by the second loss,
a second image is represented that is a second image,
representing the number of second images in a training set,
a result of a second student is presented,
the results of the second instructor are presented,
the mean square error of the second student result and the second teacher result is represented. It will be appreciated that the mean square error is the mean square error of the second student result and the second teacher result of the same second image, each training set corresponding to a second loss.
And step S206, updating the first parameters of the student network and the second parameters of the teacher network according to the first loss and the second loss. In the present embodiment, the total loss is calculated using the first loss and the second loss. Specifically, the total loss is calculated using the third formula. Wherein the third formula is
。
The total loss is expressed as a total loss,
representing the first coefficient. And after the total loss is obtained through calculation, updating a first parameter of the student network according to the total loss by using a gradient descent method, and updating a second parameter of the teacher network according to the first parameter of the student network by using a sliding average method. Specifically, the second parameter of the teacher network is updated using a fourth formula. Wherein the fourth formula is
。
A second parameter representing the current of the teacher network,
a second parameter representing the last time the teacher network,
a first parameter representing a current of the student network. Preferably, the first parameter of the student network is updated by a random gradient descent method, and the second parameter of the teacher network is updated by an exponential moving average method. Specifically, the second parameter of the teacher network is updated using a fifth formula. Wherein the fifth formula is
。
Representing the second coefficient. It will be appreciated that after the first parameter of the student network is updated with the total loss, the second parameter of the teacher network is updated with the updated first parameter. The second parameter of the teacher network is an exponential sliding average value of the first parameter of the student network.
In the embodiment, the first loss constructed according to the first student result and the first original label is a supervised loss item, so that the fitting of the first image can be effectively ensured; and a second loss constructed according to the second student result and the second teacher result is an unsupervised loss item, so that the second student result of the student network and the second teacher result of the teacher network are more similar. In addition, the second parameter of the teacher network is updated according to the first parameter of the student network by using a moving average method, so that the second parameter can be updated simply and quickly. And updating the second parameter of the teacher network according to the first parameter of the student network by using an exponential sliding average method, so that the second parameter is more accurate and credible.
Referring to fig. 3 and fig. 4 in combination, fig. 3 is a flowchart of an image classification method according to an embodiment of the present invention, and fig. 4 is a sub-flowchart of the image classification method according to the embodiment of the present invention. The image classification method specifically comprises the following steps.
Step S302, inputting the target image into an image classification model and acquiring a classification result. The image classification model is obtained by training the training method of the image classification model in the embodiment, and the image classification model can classify the target image. The image classification model comprises several sub-models. It can be understood that the target image is a label-free image, and the plurality of sub-models correspond to the plurality of optimal student networks formed in the training process one by one. Accordingly, the number of sub-models is the same as the number of optimal student networks. Inputting the image into the image classification model and obtaining the classification result specifically comprises the following steps.
Step S402, inputting the target image into each sub-model respectively and obtaining corresponding sub-results. In this embodiment, after the target image is adjusted to a preset size, the target image is respectively input into each sub-model and a corresponding sub-result is obtained. The sub-result is a label vector, the number of numerical values in the label vector is the same as the number of categories, namely the preset value, and each numerical value in the label vector corresponds to one category. It is understood that, in the training process, the image classification model adjusts the first image and the second image used for training to a preset size. When the target image is input into the image classification model, the target image also needs to be adjusted to a corresponding preset size.
And S404, obtaining a classification result according to the sub-results. In this embodiment, the number of the sub-results with the same number is counted, and the sub-result with the largest number of the sub-results with the same number is selected as the classification result. For example, the number of categories is 5, including A, B, C, D, E. The number of submodels is 5. The sub-results obtained after the same target image is respectively input into each sub-model are respectively (0.1, 0.8,0.6,0.3, 0.4), (0.7, 0.2,0.6,0.2, 0.1), (0.2, 0.9,0.5,0.4, 0.1), (0.4, 0.7,0.5,0.2, 0.1), (0.6, 0.5,0.4,0.1, 0.1). And the category corresponding to the maximum numerical value in each sub-result is the category with the highest reliability. Then, the category corresponding to each sub-result is B, A, B, B, A. Accordingly, the number of categories a is 2, and the number of categories B is 3. The number of the categories B is larger than that of the categories A, so the categories B are selected as the classification result.
In the above embodiment, the classification result is obtained by inputting the target image into each sub-model and obtaining the corresponding sub-result, then counting the sub-results, and adopting a voting manner, that is, counting the number of the same sub-results in the sub-results. Since each sub-model is trained according to the first image and the second image with the same number, the importance of each sub-model is the same, and the importance of the corresponding voting result is the same. Therefore, the corresponding classification result can be obtained by directly counting the number of the voting results, and the accuracy of the image classification model can be effectively increased.
In some possible embodiments, in the process of training the image classification model, if all the first images are divided into a plurality of subsets according to a certain proportion, when counting the number of the same sub-results, the number needs to be calculated according to the corresponding proportion. For example, if all the first images are divided into 5 subsets, the ratio of the number of first images in the 5 subsets is 1:1.2:2:1: 3. The category corresponding to each sub-result is B, A, B, B, A. Accordingly, the number of classes a is 1.2+3, 4.2; the number of classes B is 1+2+1 and 4. The number of the categories A is larger than that of the categories B, so that the categories A are selected as the classification result. That is, when counting the number of identical sub-results in the sub-results, the radix of each sub-result needs to be set to 1, and the total number can be counted after the radix of each sub-result is multiplied by the corresponding ratio.
Please refer to fig. 5, which is a schematic diagram of an internal structure of a terminal according to an embodiment of the present invention. The terminal 10 includes a computer-readable storage medium 11, a processor 12, and a bus 13. The computer-readable storage medium 11 includes at least one type of readable storage medium, which includes a flash memory, a hard disk, a multimedia card, a card-type memory (e.g., SD or DX memory, etc.), a magnetic memory, a magnetic disk, an optical disk, and the like. The computer readable storage medium 11 may in some embodiments be an internal storage unit of the terminal 10, such as a hard disk of the terminal 10. The computer readable storage medium 11 may in other embodiments be an external terminal 10 storage device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), etc. provided on the terminal 10. Further, the computer-readable storage medium 11 may also include both an internal storage unit and an external storage device of the terminal 10. The computer-readable storage medium 11 may be used not only to store application software and various types of data installed in the terminal 10 but also to temporarily store data that has been output or will be output.
The bus 13 may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in FIG. 5, but this is not intended to represent only one bus or type of bus.
Further, the terminal 10 may also include a display assembly 14. The display component 14 may be a Light Emitting Diode (LED) display, a liquid crystal display, a touch-sensitive liquid crystal display, an Organic Light-Emitting Diode (OLED) touch panel, or the like. The display component 14 may also be referred to as a display device or display unit, as appropriate, for displaying information processed in the terminal 10 and for displaying a visual user interface, among other things.
Further, the terminal 10 may also include a communication component 15. The communication component 15 may optionally include a wired communication component and/or a wireless communication component, such as a WI-FI communication component, a bluetooth communication component, etc., typically used to establish a communication connection between the terminal 10 and other intelligent control devices.
The processor 12 may be, in some embodiments, a Central Processing Unit (CPU), controller, microcontroller, microprocessor or other data Processing chip for executing program codes stored in the computer-readable storage medium 11 or Processing data. Specifically, the processor 12 executes a processing program of multi-source heterogeneous data to control the terminal 10 to implement the training method of the image classification model.
It is to be understood that fig. 5 only shows the terminal 10 with the components 11-15 and the training method for implementing the image classification model, and that those skilled in the art will appreciate that the structure shown in fig. 5 does not constitute a limitation of the terminal 10 and may include fewer or more components than those shown, or some components may be combined, or a different arrangement of components.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, insofar as these modifications and variations of the invention fall within the scope of the claims of the invention and their equivalents, the invention is intended to include these modifications and variations.
The above-mentioned embodiments are only examples of the present invention, which should not be construed as limiting the scope of the present invention, and therefore, the present invention is not limited by the claims.