CN115375963A

CN115375963A - Image recognition model training method and device based on multi-level labels

Info

Publication number: CN115375963A
Application number: CN202210888906.6A
Authority: CN
Inventors: 王远航; 王静宇; 李蹊; 李建华; 李浩浩; 张昆鹏; 梅一多
Original assignee: Zhongguancun Smart City Co Ltd
Current assignee: Zhongguancun Smart City Co Ltd
Priority date: 2022-07-27
Filing date: 2022-07-27
Publication date: 2022-11-22
Anticipated expiration: 2042-07-27
Also published as: CN115375963B

Abstract

The embodiment of the disclosure discloses an image recognition model training method and device based on a multi-level label. One embodiment of the method comprises: carrying out multi-level labeling on the sample image set to obtain a three-level label of each sample image in the sample image set; constructing an initial image recognition network; and performing three-stage training at different learning rates based on the sample image set and the three-stage labels corresponding to the sample images in the sample image set to obtain an image recognition model. The problem of transfer mismatching is avoided while the problem that training samples are not enough is solved.

Description

Image recognition model training method and device based on multi-level labels

Technical Field

The embodiment of the disclosure relates to the technical field of computers, in particular to an image recognition model training method and device based on multi-level labels.

Background

The transfer learning is a machine learning method, namely, a model developed for a task A is taken as an initial point and is reused in the process of developing the model for a task B. Transfer learning is widely applied in the model training process to solve the problem of insufficient training samples.

However, the inventors have found that when training is performed using transfer learning, there are often technical problems as follows:

first, in the process of developing a model for task B, the problem of migration mismatch is often caused by the type difference between task a and task B.

Secondly, a staged learning mode is adopted, so that how to accurately set the learning rates in different stages is difficult, and the learning rates in different stages are difficult to accurately set at present generally according to experience.

And thirdly, in the two-stage training process and the three-stage training process, how to ensure that the characteristics learned before are well transferred and further learn.

The above information disclosed in this background section is only for enhancement of understanding of the background of the inventive concept and, therefore, it may contain information that does not form the prior art that is already known to a person of ordinary skill in the art.

Disclosure of Invention

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

Some embodiments of the present disclosure propose a method, apparatus, device, and computer readable medium for training an image recognition model based on multi-level labels to solve one or more of the technical problems mentioned in the background section above.

In a first aspect, some embodiments of the present disclosure provide a method for training an image recognition model based on multi-level labels, including: performing multi-level labeling on the sample image set to obtain a three-level label of each sample image in the sample image set, wherein the three-level label comprises a first-level label, a second-level label and a third-level label; constructing an initial image recognition network, wherein the initial image recognition network comprises an input layer, a middle layer and a first-level output layer; training an initial image recognition network at a first learning rate based on a sample image set and primary labels corresponding to sample images in the sample image set to obtain a one-stage image recognition network; deleting a first-stage output layer of the one-stage image recognition network, and adding a first full-connection layer and a second-stage output layer to obtain a processed one-stage image recognition network; training the processed one-stage image recognition network at a second learning rate based on the first-stage label and the second-stage label corresponding to the sample images in the sample image set and the sample image set to obtain a two-stage image recognition network, wherein the second learning rate is smaller than the first learning rate; deleting a second-stage output layer in the two-stage image recognition network, and adding a second full-connection layer and a third-stage output layer to obtain a processed two-stage image recognition network; and training the post-processing two-stage image recognition network at a third learning rate based on the first-stage label, the second-stage label and the third-stage label corresponding to the sample images in the sample image set and the sample image set to obtain an image recognition model, wherein the third learning rate is smaller than the second learning rate.

In a second aspect, some embodiments of the present disclosure provide an apparatus for training an image recognition model based on multi-level labels, including: the labeling unit is configured to perform multi-stage labeling on the sample image set to obtain a three-stage label of each sample image in the sample image set, wherein the three-stage label comprises a first-stage label, a second-stage label and a third-stage label; a construction unit configured to construct an initial image recognition network, the initial image recognition network including an input layer, an intermediate layer, and a primary output layer; the training unit is configured to train the initial image recognition network at a first learning rate based on the sample image set and the primary labels corresponding to the sample images in the sample image set to obtain a one-stage image recognition network; the network processing unit is configured to delete a first-stage output layer of the one-stage image recognition network, and add a first full-connection layer and a second-stage output layer to obtain a processed one-stage image recognition network; the training unit is further configured to train the processed one-stage image recognition network at a second learning rate based on the first-stage label and the second-stage label corresponding to the sample images in the sample image set and the sample image set to obtain a two-stage image recognition network, wherein the second learning rate is smaller than the first learning rate; the network processing unit is further configured to delete the second-stage output layer in the two-stage image recognition network, and add a second full-connection layer and a third-stage output layer to obtain a processed two-stage image recognition network; and the training unit is further configured to train the post-processing two-stage image recognition network at a third learning rate based on the first-stage label, the second-stage label and the third-stage label corresponding to the sample images in the sample image set and the sample image set, so as to obtain an image recognition model, wherein the third learning rate is smaller than the second learning rate.

In a third aspect, some embodiments of the present disclosure provide an electronic device, comprising: one or more processors; a storage device, on which one or more programs are stored, which when executed by one or more processors cause the one or more processors to implement the method described in any implementation of the first aspect.

In a fourth aspect, some embodiments of the present disclosure provide a computer readable medium on which a computer program is stored, wherein the program, when executed by a processor, implements the method described in any of the implementations of the first aspect.

The above embodiments of the present disclosure have the following advantages: through setting up tertiary label for the sample image to and carry out staged training through tertiary label, can let the model study the characteristic of different levels, thereby when solving training sample not enough, avoided the unmatched problem of migration. Specifically, since the labels of different levels essentially belong to the classification of images at different granularities, both belonging to the image classification task, it is possible to avoid migration mismatch between different tasks. In addition, through different learning rates adopted in different training stages, and the learning rate is reduced along with the depth of the training stages, the network can fully learn finer-grained semantic information, and the prediction accuracy of the model is further improved.

Drawings

The above and other features, advantages and aspects of various embodiments of the present disclosure will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. Throughout the drawings, the same or similar reference numbers refer to the same or similar elements. It should be understood that the drawings are schematic and that elements and elements are not necessarily drawn to scale.

FIG. 1 is a flow diagram of some embodiments of a multi-level label-based image recognition model training method according to the present disclosure;

FIG. 2 is a block diagram of some embodiments of a multi-level label based image recognition model training apparatus according to the present disclosure;

FIG. 3 is a schematic structural diagram of an electronic device suitable for use in implementing some embodiments of the present disclosure;

FIG. 4 is an exemplary schematic of a three-level tag.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it is to be understood that the disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and the embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.

It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings. The embodiments and features of the embodiments in the present disclosure may be combined with each other without conflict.

It should be noted that the terms "first", "second", and the like in the present disclosure are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence of the functions performed by the devices, modules or units.

It is noted that references to "a", "an", and "the" modifications in this disclosure are intended to be illustrative rather than limiting, and that those skilled in the art will recognize that "one or more" may be used unless the context clearly dictates otherwise.

The names of messages or information exchanged between devices in the embodiments of the present disclosure are for illustrative purposes only, and are not intended to limit the scope of the messages or information.

The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

Referring to fig. 1, a flow 100 of some embodiments of a multi-level label-based image recognition model training method according to the present disclosure is shown. The image recognition model training method based on the multi-level labels comprises the following steps:

step 101, performing multi-level labeling on the sample image set to obtain a three-level label of each sample image in the sample image set, wherein the three-level label comprises a first-level label, a second-level label and a third-level label.

In some embodiments, an executive body of the image recognition model training method based on the multi-level labels may first label each sample image of the sample image set by receiving manual input or by using automatic clustering or the like, so as to obtain a three-level label of each sample image.

As an example, as shown in fig. 4, in the application scenario of smart city management, the first-level tags may include two categories of city management and interpersonal management. The category corresponding to each primary label comprises at least one category corresponding to the secondary label. Taking city management as an example, the city management comprises the corresponding categories of a city appearance environment, consumer rights and interests and violation random construction of the three secondary labels. Similarly, the category corresponding to each secondary label comprises at least one category corresponding to the tertiary label. Taking the city environment as an example, the city environment comprises three categories corresponding to three-level tags of public toilet messiness, road garbage and wall small advertisements.

Optionally, the number of neurons included in the first-level output layer is equal to the number of classes corresponding to the first-level tag, for example, the number of neurons included in the first-level output layer is 2. Similarly, the number of neurons included in the second-level output layer is equal to the number of categories corresponding to the second-level labels, the number of neurons included in the third-level output layer is equal to the number of categories corresponding to the third-level labels, the categories corresponding to the first-level labels include at least one category corresponding to the second-level labels, and the categories corresponding to the second-level labels include at least one category corresponding to the third-level labels.

Step 102, an initial image recognition network is constructed, wherein the initial image recognition network comprises an input layer, a middle layer and a first-level output layer.

In some embodiments, the execution subject may invoke an existing network or construct the initial image recognition network by adapting the existing network. For example, the VGG16 network may be determined to be the initial image recognition network. The initial image recognition network comprises an input layer, an intermediate layer and a primary output layer.

And 103, training the initial image recognition network at a first learning rate based on the sample image set and the primary labels corresponding to the sample images in the sample image set to obtain a one-stage image recognition network.

In some embodiments, the performing subject may select a plurality of sample images from the sample image set to perform a one-stage training on the initial image recognition network. Specifically, the selected sample image is input into an initial image recognition network to obtain an actual recognition result. On the basis, the difference between the actual recognition result and the first-level label is determined according to a predetermined loss function, and the difference is reversely transmitted into the initial image recognition network so as to adjust the parameters of each layer of the network. Wherein the new parameter (new _ weight) for each layer of the network can be calculated by the following formula:

new_weight = old_weight - learning_rate * gradient

wherein old _ weight is a pre-update parameter (old parameter), learning _ rate is a learning rate, and gradient is a gradient. It can be seen that the larger the learning rate, the faster the model converges, and vice versa the slower the model converges. The first learning rate may be α 1, and a larger learning rate may be set to ensure fast convergence of the model.

And 104, deleting the primary output layer of the one-stage image recognition network, and adding a first full-connection layer and a secondary output layer to obtain the processed one-stage image recognition network.

In some embodiments, the execution subject may delete a first output layer of the one-stage image recognition network, and add a first full connection layer and a second output layer to obtain the post-processing one-stage image recognition network. So that the first fully connected layer can continue to learn other features as a carrier.

And 105, training the processed one-stage image recognition network at a second learning rate based on the first-stage label and the second-stage label corresponding to the sample images in the sample image set and the sample image set to obtain a two-stage image recognition network, wherein the second learning rate is smaller than the first learning rate.

In some embodiments, two-phase training may continue. In this case, the second learning rate is smaller than the first learning rate, and a feature of a finer granularity can be learned.

For example, the sample image may be input to a post-processing stage image recognition network. On the basis, the output of the middle layer is continuously input into the Softmax function to obtain a first prediction result, and the secondary output layer outputs a second prediction result. And respectively determining a first difference between the first prediction result and the first label and a second difference between the second prediction result and the second label. And after the first difference and the second difference are superposed, adjusting the parameters of each layer by using an algorithm of back propagation and random gradient descent.

And 106, deleting the secondary output layer in the two-stage image recognition network, and adding a second full-connection layer and a third output layer to obtain the processed two-stage image recognition network.

In some embodiments, similar to step 104, structural adjustments may continue to be made to the two-phase image recognition network, adding a second fully-connected layer as a finer-grained learning vehicle. Wherein, the concrete structure of the second full tie layer and the first full tie layer can set up according to actual need.

And 107, training the post-processing two-stage image recognition network at a third learning rate based on the first-stage label, the second-stage label and the third-stage label corresponding to the sample images in the sample image set and the sample image set to obtain an image recognition model, wherein the third learning rate is smaller than the second learning rate.

Wherein, similar to step 105, the output of the middle layer can be continuously input into the Softmax function, resulting in the first actual output result. In addition, the output of the first full connection layer can be continuously input into the Softmax function, and a second actual output result is obtained. And the output of the three-stage output layer is a third actual output result. On the basis, the difference between the first actual output result and the first-level label, the difference between the second actual output result and the second-level label, and the difference between the third actual output result and the third-level label are respectively determined. And then, after the differences are superposed, parameters of each layer are adjusted through an algorithm of back propagation and random gradient descent.

In some embodiments, by setting the three-level labels for the sample images and performing staged training through the three-level labels, the model can learn features of different levels, so that the problem of unmatched transfer learning is avoided while the problem of insufficient training samples is solved. Specifically, since the labels of different levels essentially belong to the classification of images at different granularities, both belonging to the image classification task, it is possible to avoid migration mismatch between different tasks. In addition, through different learning rates adopted in different training stages, and the learning rate is reduced along with the training stages, the network can fully learn semantic information with finer granularity, and the prediction accuracy of the model is further improved.

In order to solve the second technical problem in the background art, namely that a staged learning mode is adopted, and how to accurately set the learning rates in different stages is difficult, as an invention point of the present disclosure, some embodiments of the present disclosure set the learning rates in the following manner, which not only can ensure that the characteristics of finer granularity are learned, but also can avoid that the model convergence is too slow.

The first learning rate is α 1, the second learning rate is α 2, and the third learning rate is α 3.

ɑ2=ɑ1 /(1 + mt ₁ )，

ɑ3=ɑ2/(1 + nt ₂ )，

Wherein m is a hyper-parameter for controlling the learning rate slowing-down amplitude, and can be set as a constant according to the actual situation. t is t ₁ The number of training rounds of a stage of training. In this process, α 2 is smoother and avoids dropping too fast or too slow due to the reference of the number of training rounds of the first-stage training. Similarly, n is a hyper-parameter for controlling the learning rate slowing-down amplitude, and can be set to be a constant different from m according to actual conditions. t is t ₂ The number of training rounds of the two-stage training is shown.

In these implementations, the learning rate of the same stage is fixed, and the decrease in learning rate of different training stages is related to the number of training rounds of the previous stage. In practice, the more training rounds are in the last stage, the more sufficient the feature learning is, so that the relatively smaller learning rate can be set in the stage, and the more fine-grained features can be learned by the model. On the contrary, the smaller the number of rounds, the insufficient learning, and a relatively large learning rate needs to be set at this stage to accelerate the model convergence.

In some embodiments, before constructing the initial image recognition network, the method further comprises: for the tertiary label of each sample image, determining whether the primary label of the sample image contains the secondary label of the sample image and whether the secondary label of the sample image contains the tertiary label of the sample image; in response to determining that the primary label of the sample image comprises the secondary label of the sample image and the secondary label of the sample image comprises the tertiary label of the sample image, outputting prompt information that the tertiary label of the sample image is correct; and in response to the fact that the primary label of the sample image does not contain the secondary label of the sample image or the secondary label of the sample image does not contain the tertiary label of the sample image, outputting incorrect prompt information of the tertiary label of the sample image, and correcting the tertiary label of each sample image to obtain the verified tertiary label.

In some embodiments, correcting the tertiary label of each sample image to obtain a verified tertiary label includes: respectively taking the category number corresponding to the first-level label, the category number corresponding to the second-level label and the category number corresponding to the third-level label as cluster numbers, and performing cubic clustering on the sample image set to obtain three clustering results, wherein the three clustering results comprise a first clustering result, a second clustering result and a third clustering result; and verifying based on the first clustering result, the second clustering result and the third clustering result so as to correct the three-level label of each sample image to obtain a verified three-level label.

In some embodiments, the verifying is performed based on the first clustering result, the second clustering result, and the third clustering result to correct the tertiary label of each sample image, and obtain a verified tertiary label, including: determining whether the primary label of each sample image is consistent with the primary label corresponding to the cluster in which the sample image is located in the first clustering result, and if not, determining the primary label corresponding to the cluster in which the sample image is located as the primary label of the sample image; determining whether the secondary label is consistent with the secondary label corresponding to the cluster of the sample image in the second clustering result, and if not, determining the secondary label corresponding to the cluster of the sample image as the secondary label of the sample image; and determining whether the third-level label is consistent with the third-level label corresponding to the cluster where the sample image is located in the third clustering result, and if not, determining the third-level label corresponding to the cluster where the sample image is located as the third-level label of the sample image.

Therefore, the label can be corrected, and the prediction accuracy of the model is improved.

In order to solve the technical problem of the background art, i.e. how to ensure that the previously learned features are well migrated and further learn during the two-stage training and three-stage training, as an inventive point of the present disclosure, some embodiments of the present disclosure perform two-stage training by the following steps:

training the processed one-stage image recognition network at a second learning rate based on the first-stage label and the second-stage label corresponding to the sample images in the sample image set and the sample image set to obtain a two-stage image recognition network, wherein the second learning rate is smaller than the first learning rate, and the method comprises the following steps:

step one, inputting a sample image into a post-processing stage image recognition network to obtain an actual classification result.

And step two, setting loss weight for each layer of the image recognition network at the post-processing stage.

Wherein, by setting the loss weight per layer, the model can be made to focus on different parts of the network in different stages, e.g. on the middle layer in the first stage, on the first fully-connected layer in the second stage, and on the second fully-connected layer in the third stage. That is to say, during the two-stage training, the parameters of the middle layer are finely adjusted, and the parameters of the first fully-connected layer are mainly adjusted, so that the loss weight of the middle layer can be set to be smaller, and the loss weight of the first fully-connected layer can be set to be larger. In the process, the parameters can be further optimized by further adjusting the parameters of the middleware layer, but the loss weight of the middleware layer is small, namely fine adjustment is meant, so that the characteristics which are learned before can not be damaged. Thereby finding a balance in knowledge migration and further learning. In practice, the mathematical weight of each layer at different stages can be obtained by looking up the loss weight table.

And during three-stage training, fine adjustment is carried out on parameters of the middle layer and the first full-connection layer, and parameters of the second full-connection layer are mainly adjusted. Thus, the training result of the previous stage can be ensured

And step three, updating the parameters of each layer in the post-processing one-stage image recognition network according to the actual classification result, the loss weight corresponding to each layer and the second learning rate until the training end condition is reached, and obtaining the two-stage image recognition network.

Similar to the two-stage training described above, the three-stage training may set the loss weights of the middle layer and the first fully-connected layer to be smaller, while the loss weights of the second fully-connected layer are set to be larger. Therefore, the learning of the second full connection layer can be focused, and therefore the characteristics learned before are kept, and fine adjustment is carried out to adapt to a new task.

It can be understood that in the staged training process, the output results of the middle layer and the first fully-connected layer can be obtained by utilizing the Softmax function, so that the training can be simultaneously carried out by utilizing the multi-stage labels. In the process, as the multi-level labels of the same sample have the inclusion relationship, and the multi-level labels are used for training, the model can learn the inclusion relationship, so that the prediction accuracy of the model is increased.

With further reference to fig. 2, as an implementation of the methods shown in the above figures, the present disclosure provides some embodiments of an apparatus for training an image recognition model based on multi-level labels, which correspond to those of the method shown in fig. 1, and which may be applied in various electronic devices.

As shown in fig. 2, the multi-level label-based image recognition model training apparatus 200 of some embodiments includes: the labeling unit 201 is configured to perform multi-level labeling on the sample image set to obtain a three-level label of each sample image in the sample image set, where the three-level label includes a first-level label, a second-level label, and a third-level label; the construction unit 202 is configured to construct an initial image recognition network comprising an input layer, an intermediate layer and a primary output layer; the training unit 203 is configured to train the initial image recognition network at a first learning rate based on the sample image set and the primary labels corresponding to the sample images in the sample image set, so as to obtain a one-stage image recognition network; the network processing unit 204 is configured to delete a first-stage output layer of the one-stage image recognition network and add a first full-connection layer and a second-stage output layer to obtain a processed one-stage image recognition network; the training unit 205 is further configured to train the post-processing one-stage image recognition network at a second learning rate based on the first-stage label and the second-stage label corresponding to the sample images in the sample image set and the sample images in the sample image set, resulting in a two-stage image recognition network, where the second learning rate is smaller than the first learning rate; the network processing unit 204 is further configured to delete the secondary output layer in the two-stage image recognition network and add a second fully connected layer and a third output layer, resulting in a processed two-stage image recognition network; the training unit 205 is further configured to train the post-processing two-stage image recognition network with a third learning rate based on the first-stage label, the second-stage label, and the third-stage label corresponding to the sample images in the sample image set and the sample images in the sample image set, so as to obtain an image recognition model, where the third learning rate is smaller than the second learning rate.

It will be appreciated that the units described in the apparatus 200 correspond to the various steps in the method described with reference to figure 1. Thus, the operations, features and resulting advantages described above with respect to the method are also applicable to the apparatus 200 and the units included therein, and are not described herein again.

Referring now to FIG. 3, a block diagram of an electronic device 300 suitable for use in implementing some embodiments of the present disclosure is shown. The electronic device shown in fig. 3 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

As shown in fig. 3, the electronic device 300 may include a processing means (e.g., a central processing unit, a graphics processor, etc.) 301 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM) 302 or a program loaded from a storage means 308 into a Random Access Memory (RAM) 303. In the RAM 303, various programs and data necessary for the operation of the electronic apparatus 300 are also stored. The processing device 301, the ROM 302, and the RAM 303 are connected to each other via a bus 304. An input/output (I/O) interface 305 is also connected to bus 304.

Generally, the following devices may be connected to the I/O interface 305: input devices 306 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, or the like; an output device 307 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage devices 308 including, for example, magnetic tape, hard disk, etc.; and a communication device 309. The communication means 309 may allow the electronic device 300 to communicate wirelessly or by wire with other devices to exchange data. While fig. 3 illustrates an electronic device 300 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may be alternatively implemented or provided. Each block shown in fig. 3 may represent one device or may represent multiple devices, as desired.

In particular, according to some embodiments of the present disclosure, the processes described above with reference to the flow diagrams may be implemented as computer software programs. For example, some embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In some such embodiments, the computer program may be downloaded and installed from a network through the communication device 309, or installed from the storage device 308, or installed from the ROM 302. The computer program, when executed by the processing apparatus 301, performs the above-described functions defined in the methods of some embodiments of the present disclosure.

It should be noted that the computer readable medium described in some embodiments of the present disclosure may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In some embodiments of the disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In some embodiments of the present disclosure, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

In some embodiments, the clients, servers may communicate using any currently known or future developed network Protocol, such as HTTP (HyperText Transfer Protocol), and may interconnect with any form or medium of digital data communication (e.g., a communications network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the Internet (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed network.

The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device. Computer program code for carrying out operations for embodiments of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems on a chip (SOCs), complex Programmable Logic Devices (CPLDs), and the like.

The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention in the embodiments of the present disclosure is not limited to the specific combinations of the above-mentioned features, and other embodiments in which the above-mentioned features or their equivalents are combined arbitrarily without departing from the spirit of the invention are also encompassed. For example, the above features and (but not limited to) technical features with similar functions disclosed in the embodiments of the present disclosure are mutually replaced to form the technical solution.

Claims

1. An image recognition model training method based on multi-level labels comprises the following steps:

carrying out multi-level labeling on a sample image set to obtain a three-level label of each sample image in the sample image set, wherein the three-level label comprises a first-level label, a second-level label and a third-level label;

constructing an initial image recognition network, wherein the initial image recognition network comprises an input layer, a middle layer and a primary output layer;

training the initial image recognition network at a first learning rate based on the sample image set and primary labels corresponding to the sample images in the sample image set to obtain a stage image recognition network;

deleting a first-stage output layer of the first-stage image recognition network, and adding a first full-connection layer and a second-stage output layer to obtain a processed first-stage image recognition network;

training the processed one-stage image recognition network at a second learning rate based on the sample image set and the primary label and the secondary label corresponding to the sample images in the sample image set to obtain a two-stage image recognition network, wherein the second learning rate is smaller than the first learning rate;

deleting a second-stage output layer in the two-stage image recognition network, and adding a second full-connection layer and a third-stage output layer to obtain a processed two-stage image recognition network;

training the post-processing two-stage image recognition network at a third learning rate based on the first-stage label, the second-stage label and the third-stage label corresponding to the sample images in the sample image set to obtain the image recognition model, wherein the third learning rate is smaller than the second learning rate.

2. The method of claim 1, wherein the number of neurons included in the primary output layer is equal to the number of classes corresponding to the primary tag, the number of neurons included in the secondary output layer is equal to the number of classes corresponding to the secondary tag, and the number of neurons included in the tertiary output layer is equal to the number of classes corresponding to the tertiary tag.

3. The method of claim 2, wherein prior to said constructing an initial image recognition network, said method further comprises:

for each sample image's tertiary label, determining whether the sample image's primary label contains the sample image's secondary label and whether the sample image's secondary label contains the sample image's tertiary label;

in response to determining that the primary label of the sample image includes the secondary label of the sample image and the secondary label of the sample image includes the tertiary label of the sample image, outputting prompt information that the tertiary label of the sample image is correct;

and in response to the fact that the primary label of the sample image does not contain the secondary label of the sample image or the secondary label of the sample image does not contain the tertiary label of the sample image, outputting incorrect prompt information of the tertiary label of the sample image, and correcting the tertiary label of the sample image to obtain a verified tertiary label.

4. The method of claim 3, wherein the correcting the tertiary label of the sample image to obtain a verified tertiary label comprises:

respectively taking the category number corresponding to the first-level label, the category number corresponding to the second-level label and the category number corresponding to the third-level label as cluster numbers, and performing cubic clustering on the sample image set to obtain three clustering results, wherein the three clustering results comprise a first clustering result, a second clustering result and a third clustering result;

and verifying based on the first clustering result, the second clustering result and the third clustering result to correct the three-level label of the sample image to obtain a verified three-level label.

5. The method of claim 4, wherein the verifying based on the first clustering result, the second clustering result, and the third clustering result to correct the tertiary label of the sample image to obtain a verified tertiary label comprises:

for the primary label of each sample image, determining whether the primary label is consistent with the primary label corresponding to the cluster where the sample image is located in the first clustering result, and if not, determining the primary label corresponding to the cluster where the sample image is located as the primary label of the sample image;

determining whether the secondary label is consistent with the secondary label corresponding to the cluster where the sample image is located in the second clustering result or not for the secondary label of the sample image, and if not, determining the secondary label corresponding to the cluster where the sample image is located as the secondary label of the sample image;

and determining whether the three-level label is consistent with the three-level label corresponding to the cluster where the sample image is located in the third clustering result, and if not, determining the three-level label corresponding to the cluster where the sample image is located as the three-level label of the sample image.

6. An image recognition model training device based on multi-level labels comprises:

the labeling unit is configured to perform multi-stage labeling on the sample image set to obtain a three-stage label of each sample image in the sample image set, wherein the three-stage label comprises a first-stage label, a second-stage label and a third-stage label;

a construction unit configured to construct an initial image recognition network including an input layer, an intermediate layer, and a primary output layer;

the training unit is configured to train the initial image recognition network at a first learning rate based on the sample image set and primary labels corresponding to sample images in the sample image set to obtain a one-stage image recognition network;

the network processing unit is configured to delete a first-stage output layer of the one-stage image recognition network and add a first full-connection layer and a second-stage output layer to obtain a processed one-stage image recognition network;

the training unit is further configured to train the post-processing one-stage image recognition network at a second learning rate based on the sample image set and the primary label and the secondary label corresponding to the sample images in the sample image set, so as to obtain a two-stage image recognition network, wherein the second learning rate is smaller than the first learning rate;

the network processing unit is further configured to delete a second-stage output layer in the two-stage image recognition network and add a second full-connection layer and a third-stage output layer to obtain a processed two-stage image recognition network;

the training unit is further configured to train the post-processing two-stage image recognition network at a third learning rate based on the first-stage label, the second-stage label and the third-stage label corresponding to the sample images in the sample image set and the sample images in the sample image set, so as to obtain the image recognition model, wherein the third learning rate is smaller than the second learning rate.

7. An electronic device, comprising:

one or more processors;

a storage device having one or more programs stored thereon,

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-5.

8. A computer-readable medium, on which a computer program is stored, wherein the program, when executed by a processor, implements the method of any one of claims 1-5.