CN114612743A

CN114612743A - Deep learning model training method, target object identification method and device

Info

Publication number: CN114612743A
Application number: CN202210234795.7A
Authority: CN
Inventors: 刘欢; 谭资昌; 赵耀; 李晓龙; 郭国栋
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2022-03-10
Filing date: 2022-03-10
Publication date: 2022-06-10

Abstract

The present disclosure provides a training method and apparatus for a deep learning model, a target object recognition method and apparatus, an electronic device, a storage medium, and a computer program product, which relate to the field of artificial intelligence, and in particular to the field of deep learning, image recognition, and computer vision. The specific implementation scheme is as follows: obtaining a fusion characteristic diagram according to the initial vector characteristic diagram of the sample image, wherein the sample image comprises a target object and a label of the target object; obtaining a first classification difference value of the target object according to the fusion feature map and the label; determining an enhanced image corresponding to the sample image according to the first classification difference value and the initial vector feature map; and training the deep learning model by using the enhanced image.

Description

Deep learning model training method, target object identification method and device

Technical Field

The present disclosure relates to the field of artificial intelligence technology, and more particularly, to the field of depth learning, image recognition, and computer vision technology. More specifically, the present disclosure provides a training method and apparatus for a deep learning model, a target object recognition method and apparatus, an electronic device, a storage medium, and a computer program product.

Background

In the field of face forgery detection, a traditional method can adopt a two-classification network model structure to carry out face forgery detection. However, in the training process of the network model, a large number of network training parameters need to be introduced, which increases the amount of calculation, or ignores the problem of actual adaptability between the training data and the model, resulting in low detection accuracy.

Disclosure of Invention

The disclosure provides a training method and device for a deep learning model, a target object identification method and device, an electronic device, a storage medium and a computer program product.

According to an aspect of the present disclosure, there is provided a training method of a deep learning model, including:

obtaining a fusion feature map according to an initial vector feature map of a sample image, wherein the sample image comprises a target object and a label of the target object;

obtaining a first classification difference value of the target object according to the fusion feature map and the label;

determining an enhanced image corresponding to the sample image according to the first classification difference value and the initial vector feature map; and

training the deep learning model using the enhanced image.

According to another aspect of the present disclosure, there is provided a target object identification method, including:

and inputting the image to be recognized into a deep learning model to obtain the recognition result of the target object in the image to be recognized, wherein the deep learning model is trained by the method.

According to another aspect of the present disclosure, there is provided a training apparatus for a deep learning model, including:

the fusion module is used for obtaining a fusion characteristic diagram according to an initial vector characteristic diagram of a sample image, wherein the sample image comprises a target object and a label of the target object;

the calculation module is used for obtaining a first classification difference value of the target object according to the fusion feature map and the label;

the enhancement module is used for determining an enhanced image corresponding to the sample image according to the first classification difference value and the initial vector feature map; and

and the training module is used for training the deep learning model by utilizing the enhanced image.

According to a fourth aspect, there is provided a target object recognition apparatus comprising:

the recognition module is used for inputting an image to be recognized into a deep learning model to obtain a recognition result of a target object in the image to be recognized, wherein the deep learning model is trained by using a training device of any one of the deep learning models.

According to another aspect of the present disclosure, there is provided an electronic device including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method as described above.

According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to execute a method implementing the above.

According to another aspect of the present disclosure, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the method as described above.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is an exemplary system architecture of a training method, a target object recognition method and apparatus that may be applied to a deep learning model according to embodiments of the present disclosure;

FIG. 2 is a flow diagram of a method of training a deep learning model according to an embodiment of the present disclosure;

FIG. 3 is a flow diagram of a method of obtaining a fused feature map according to an embodiment of the present disclosure;

FIG. 4 is a flow diagram of a method of determining an enhanced image according to an embodiment of the present disclosure;

FIG. 5A is a flow diagram of a method of determining an enhanced image according to another embodiment of the present disclosure;

FIG. 5B is a schematic diagram of a method of determining an enhanced image according to another embodiment of the present disclosure;

FIG. 6 is a flow diagram of a method of training a deep learning model according to an embodiment of the present disclosure;

fig. 7A is a flowchart of a method of determining a second classification difference value of a target object according to an embodiment of the present disclosure;

fig. 7B is a flowchart of a method of determining a second classification difference value of a target object according to another embodiment of the present disclosure;

8A-8C are schematic diagrams of a training method of a deep learning model according to embodiments of the present disclosure;

FIG. 9 is a flow chart diagram of a target object identification method according to an embodiment of the present disclosure;

FIG. 10 is a block diagram of a training apparatus for deep learning models, according to an embodiment of the present disclosure;

FIG. 11 is a block diagram of a target object recognition apparatus according to an embodiment of the present disclosure;

fig. 12 is a block diagram of an electronic device for a training method of a deep learning model and a target object recognition method according to an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Fig. 1 is an exemplary system architecture of a training method, a target object recognition method and apparatus that may be applied to a deep learning model according to embodiments of the present disclosure.

It should be noted that fig. 1 is only an example of a system architecture to which the embodiments of the present disclosure may be applied to help those skilled in the art understand the technical content of the present disclosure, and does not mean that the embodiments of the present disclosure may not be applied to other devices, systems, environments or scenarios. For example, in another embodiment, an exemplary system architecture to which the training method of the deep learning model, the target object recognition method, and the apparatus may be applied may include a terminal device, but the terminal device may implement the training method of the deep learning model, the target object recognition method, and the apparatus provided in the embodiments of the present disclosure without interacting with a server.

As shown in fig. 1, the system architecture 100 according to this embodiment may include

terminal devices

101, 102, 103, a network 104 and a server 105. Network 104 is the medium used to provide communication links between

terminal devices

101, 102, 103 and server 105. Network 104 may include various connection types, such as wired and/or wireless communication links, and so forth.

The user may use the

terminal devices

101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. Various communication client applications may be installed on the

terminal devices

101, 102, 103. Such as a knowledge reading class application, a web browser application, a search class application, an instant messaging tool, a mailbox client, or social platform software, etc. (by way of example only).

The

terminal devices

101, 102, 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like.

The server 105 may be various types of servers that provide various services. For example, the Server 105 may be a cloud Server, which is also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service extensibility in a conventional physical host and VPS service (Virtual Private Server). Server 105 may also be a server of a distributed system or a server that incorporates a blockchain.

It should be noted that the training method of the deep learning model provided by the embodiment of the present disclosure may be generally performed by the

terminal device

101, 102, or 103. Correspondingly, the training device for the deep learning model provided by the embodiment of the disclosure can also be arranged in the

terminal device

101, 102, or 103.

Alternatively, the training method of the deep learning model provided by the embodiment of the present disclosure may also be generally performed by the server 105. Accordingly, the training device for the deep learning model provided by the embodiment of the present disclosure may be generally disposed in the server 105. The training method of the deep learning model provided by the embodiment of the present disclosure may also be performed by a server or a server cluster that is different from the server 105 and is capable of communicating with the

terminal devices

101, 102, 103 and/or the server 105. Accordingly, the training device for the deep learning model provided by the embodiment of the present disclosure may also be disposed in a server or a server cluster different from the server 105 and capable of communicating with the

terminal devices

101, 102, 103 and/or the server 105.

It should be noted that the target object identification method provided by the embodiment of the present disclosure may be generally executed by the server 105. Accordingly, the target object recognition apparatus provided by the embodiments of the present disclosure may be generally disposed in the server 105. The target object identification method provided by the embodiments of the present disclosure may also be performed by a server or a server cluster that is different from the server 105 and is capable of communicating with the

terminal devices

101, 102, 103 and/or the server 105. Accordingly, the target object recognition apparatus provided by the embodiment of the present disclosure may also be disposed in a server or a server cluster different from the server 105 and capable of communicating with the

terminal devices

101, 102, 103 and/or the server 105.

Alternatively, the target object identification method provided by the embodiment of the present disclosure may also be generally executed by the

terminal device

101, 102, or 103. Accordingly, the target object recognition apparatus provided by the embodiment of the present disclosure may also be disposed in the

terminal device

101, 102, or 103.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

It should be noted that the sequence numbers of the respective operations in the following methods are merely used as representations of the operations for description, and should not be construed as representing the execution order of the respective operations. The method need not be performed in the exact order shown, unless explicitly stated.

Fig. 2 is a flow chart of a training method of a deep learning model according to an embodiment of the present disclosure.

As shown in FIG. 2, the training method 200 of the deep learning model includes operations S210 to S240. The deep learning model comprises a feature extraction module, a pooling module and a full-connection module. The model may be combined with a classification network for identifying target objects in the sample image.

In operation S210, a fused feature map is obtained according to the initial vector feature map of the sample image, where the sample image includes the target object and the label of the target object.

The sample image may be any one or more frames of images in a video stream acquired by a camera, or may be acquired in other manners, which is not limited in this disclosure.

It can be understood that the sample image includes different objects and corresponding features thereof, and the degree of authenticity of the features of the target object in the sample image can be predicted and analyzed by using the deep learning model, that is, the recognition result for the target object is obtained. For example, the target object may refer to a facial feature in the sample image, and the recognition result for the target object may refer to, for example, predicting and analyzing authenticity of the facial feature in the sample image using a deep learning model to obtain a recognition result about the facial feature, that is, the facial feature is a real face or a fake face. In the embodiment of the present disclosure, the target object in the sample image may be selected according to the actual situation.

In the embodiment of the present disclosure, the labels of the target objects correspond to the target objects one to one, and the label added to the target object is used to represent the authenticity of the target object in the sample image, and for convenience of representation, a label value may be used to represent the authenticity of the target object in the sample image, for example, if the target object is true, the label value may take 1, and if the target object is forged, the label value may take 0. For example, in an example where the target object is a face, the tag value may take 0 when the target object is a false face; when the target object is a real face, the tag value may take 1.

In the embodiments of the present disclosure, the label of the target object in the sample image may be used to calculate the classification difference value of the target object. And the sample image added with the label can be used for training the deep learning model so that the deep learning model obtains the capability of distinguishing true from false.

The initial vector feature map may be obtained by inputting the sample image into a feature extraction module, and inputting the initial vector feature map into a pooling module may obtain a fused feature map.

In operation S220, a first classification difference value of the target object is obtained according to the fusion feature map and the label.

For example, the fused feature map may be input into a fully connected module, and the output of the fully connected module may be input into a classification network for prediction, so as to obtain an initial classification result value for the target object. According to the initial classification result value and the label value corresponding to the label of the target object, a first classification difference value of the target object can be obtained. In one example, the classification network may be, for example, an EfficientNet-B4 classification network, or other classification networks may be used, which is not limited by this disclosure.

In the embodiment of the present disclosure, the following formula (1) may be adopted to calculate the first classification difference value of the target object in the sample image:

in the formula (1), the first and second groups of the compound,

a first classification difference value, y, representing the target object in the ith sample image_iA label value corresponding to a label representing a target object in the ith sample image,

an initial classification result value of the target object in the ith sample image is represented.

In operation S230, an enhanced image corresponding to the sample image is determined according to the first classified difference value and the initial vector feature map.

According to the first classified difference value and the initial vector feature map, an enhanced image corresponding to the sample image can be determined, and the enhanced image can be used for training a deep learning model.

In operation S240, a deep learning model is trained using the enhanced image.

In embodiments of the present disclosure, training samples used to train a deep learning model may contain multiple batches of sample images, each batch including multiple sample images. For a plurality of sample images included in a batch, a plurality of enhanced images corresponding to the plurality of sample images may be obtained according to the methods of operations S210 to S230, and the deep learning model may be trained by using the enhanced images until the model converges, so that the deep learning model may distinguish the authenticity feature in the image.

In the scheme of the embodiment of the disclosure, the enhanced image is determined by using the first classification difference value and the initial vector feature of the target object, and the deep learning model is trained by using the enhanced image, in the process, a large number of network training parameters are not required to be introduced, so that the calculated amount in the model training process is reduced, and the deep learning model is trained by using the enhanced image, so that the model learning is more robust in counterfeit feature, and the accuracy of model identification is improved.

Fig. 3 is a flowchart of a method of acquiring a fused feature map according to an embodiment of the present disclosure, and the method of acquiring the fused feature map will be exemplarily described below with reference to fig. 3.

As shown in fig. 3, the method of acquiring the fused feature map includes operations S311 to S313.

In operation S311, a pooling operation is performed on the initial vector feature map to obtain a first pooled feature map.

The pooling operation may be, for example, an average pooling operation, a maximum pooling operation, or other pooling operations, and is not particularly limited.

Taking the average pooling operation as an example, for example, the average pooling operation is performed on an initial vector feature map F with a size H × W × C, and a first pooled feature map can be obtained. The first pooling profile may be represented as follows:

in the formula (2), E_1kIndicating a first poolingCharacteristic diagram, F_i，j，kAnd representing the feature vector corresponding to the first pooling feature map, i is more than or equal to 0 and less than or equal to H-1, j is more than or equal to 0 and less than or equal to W-1, and k is more than or equal to 0 and less than or equal to C-1.

As can be seen from the above equation, performing an average pooling operation on the initial vector feature map F with a size of H × W × C is equivalent to decomposing the initial vector feature map F into H × W one-dimensional vectors (denoted as G)_i，j)，G_i，j∈R^C. That is, for each of i is 0. ltoreq. H-1, j is 0. ltoreq. W-1, k is 0. ltoreq. C-1, there are:

(G_i，j)_k＝F_i，j，k (3)

in the formula (3), (G)_i，j) k represents i x j k, F_i，j，kEssentially, a cube with length, width and height i, j and k respectively can be understood.

In operation S312, the first pooled feature map is processed with the first weight to obtain a weighted pooled feature map.

In the embodiment of the present disclosure, the first weight (denoted as U) is a weight of the fully-connected module obtained after training with the sample image of the previous batch, and the first weight indicates a distribution probability of the feature of the target object. The larger the first weight is, the higher the degree of importance of the feature of the target object is indicated. In the disclosed embodiment, the length of the first weight U is k.

According to the embodiment of the disclosure, the processing of the first pooled feature map with the first weight is to use the first weight U as the pair (G)_i，j) And k, performing weighting processing, and obtaining a weighted pooling feature map based on the vector after the weighting processing.

Using a first weight U to (G)_i，j) k, a weighting process is performed to obtain a weighted score:

in the formula (4), p_i，jRepresenting a weighted score, for each of 0 ≦ i ≦ H-1 and 0 ≦ j ≦ W-1,<·>representing the vector inner product.

In the disclosed embodiments, p_i，jCan be regarded as a target objectProbability of spurious feature distribution, ∑_i，jp_i，j＝1。

In an embodiment of the present disclosure, the weighted pooled feature map may be represented as:

in the formula (5), β represents a hyperparameter.

In the embodiment of the present disclosure, β may be set according to practical situations, and is not particularly limited.

In operation S313, a fused feature map is obtained according to the first pooled feature map and the weighted pooled feature map.

Specifically, the first pooled feature map and the weighted pooled feature map may be fused to obtain a fused feature map. The fused feature map may be represented as:

based on the above G_i，jAnd F_i，j，kThe fused feature map may be represented as:

in the embodiment of the disclosure, the first pooling feature map is processed by using the first weight, and the first pooling feature map and the weighted pooling feature map are fused to obtain a fused feature map.

Fig. 4 is a flow chart of a method of determining an enhanced image according to an embodiment of the present disclosure. An exemplary implementation of the method of determining an enhanced image will be described below with reference to fig. 4.

As shown in fig. 4, the method of determining an enhanced image includes operations S431 to S434.

In operation S431, an attention feature map is determined according to the initial vector map and the first weight.

In the embodiment of the present disclosure, an Attention feature Map (AM) is used to represent the degree of interest of the model in the local region.

According to an embodiment of the present disclosure, an attention feature map may be determined from an initial vector map and a first weight, which satisfies:

in formula (8), AM_i，jAn attention feature map is shown.

In the embodiment of the present disclosure, the attention feature map AM_i，jIs a matrix of size H × W, which is the same size as the sample image. Since the attention feature map represents the degree of interest of the model in a local region, the larger the value of the element in the matrix, the larger the sensitive region in the attention feature map, and the higher the degree of interest of the model in the attention feature map, the most sensitive region of the attention feature map can be used as a region of interest (ROI) for determining the enhanced image.

In operation S432, a predetermined number of elements having a larger numerical value in the attention feature map are determined.

According to an embodiment of the present disclosure, the predetermined number may be determined according to a selection threshold, in other words, a most sensitive region of the attention feature map may be selected as the region of interest according to the selection threshold.

According to an embodiment of the present disclosure, the selection threshold may be determined according to an average first classification difference value of N first classification difference values of N sample images in one lot.

Specifically, the selection threshold may be calculated by the following formula:

in formula (9), η represents the selection threshold, N represents the number of sample images in one batch,

the first classification difference value is expressed, and in formula (10), α is a hyper parameter, for example, α is 0.1.

In this embodiment of the disclosure, the N first classification difference values of the N sample images in one batch may be obtained by using the first classification difference value calculation formula, and details are not repeated here.

It can be known from the above formula that when the first classification difference value is large, the model has not learned a valid counterfeit feature yet, and accordingly, the selection threshold is small, and vice versa. The most sensitive region of the attention feature map is selected as the region of interest according to the selection threshold, so that the size of the region of interest (i.e. the size of the enhancement matrix) can be adaptively controlled, and the degree of enhancement of the sample image can be controlled.

In the embodiments of the present disclosure, determining the predetermined number according to the selection threshold may refer to, for example, determining a proportion of elements that need to be enhanced in an attention feature map (e.g., a matrix with a size of H × W) according to the selection threshold. For example, H × W elements in the attention feature map may be sorted by numerical size, and the top 20% (for example only) sorted elements may be selected according to a selection threshold to determine the enhancement matrix.

In operation S433, an enhancement matrix is obtained according to a predetermined number of elements.

Since the number of the predetermined number of elements may be smaller than the size of the sample image, in order to facilitate subsequent calculation, in obtaining the enhanced image from the predetermined number of elements, zero may be padded at a relevant position so that the size of the enhancement matrix coincides with the size of the sample image.

In operation S434, an enhanced image is determined according to the enhancement matrix and the sample image.

An enhanced image may be determined from the enhancement matrix and the sample images, and the enhanced image may be used to train a training deep learning model.

A method of determining an enhanced image is explained below with reference to fig. 5A and 5B.

Fig. 5A is a flow diagram of a method of determining an enhanced image according to another embodiment of the present disclosure.

As shown in fig. 5A, the method of determining an enhanced image includes operations S5341 to S5342.

In operation S5341, a sample image is smoothed to obtain a smoothed sample image.

In the embodiment of the present disclosure, the sample image is smoothed by any one or more image smoothing methods, such as gaussian smoothing, or by other image smoothing methods, which are not limited herein.

In operation S5342, an enhanced image is determined from the smoothed sample image, the sample image, and the enhancement matrix.

In the embodiment of the disclosure, before determining the enhanced image, a complementary matrix corresponding to other regions except the region of interest may be determined according to the enhanced matrix, and the region characterized by the complementary matrix has removed the potential counterfeit region. From the complement matrix, the smoothed sample image, the sample image, and the enhancement matrix, an enhanced image may be determined.

In the disclosed embodiment, the following formula may be employed to determine the enhanced image:

in the formula (11), the reaction mixture,

representing an enhanced image, T representing an enhancement matrix, Blur (x)_i) Means for indicating flatnessSliding sample image, x_iRepresenting a sample image, an indicates a hadamard product.

Based on the formula, the enhanced image can be further accurately determined, so that the accuracy of model training is improved.

Fig. 5B is a schematic diagram of a method of determining an enhanced image according to another embodiment of the present disclosure.

As shown in fig. 5B, in an example in which the target object is a face, an enhancement matrix T characterizing the most sensitive forged region in the sample image R is determined based on the above-described method, from which a complementary matrix T' can be determined.

And according to the enhanced image calculation formula, respectively calculating the Hadamard product between the enhanced matrix and the smooth sample image M and the Hadamard product between the sample image R and the complementary matrix T', and adding the two Hadamard product results to obtain an enhanced image D. Most semantic information of the sample image R is reserved in the enhanced image D, and the enhanced image D performs feature enhancement on a potential forged region (shown by a dotted line) in the sample image R, so that when the model is trained by using the enhanced image D subsequently, the model can learn more stable and accurate features.

Fig. 6 is a flow diagram of a method of training a deep learning model according to an embodiment of the present disclosure. An exemplary implementation of the method of training the deep learning model will be described below with reference to fig. 6.

As shown in fig. 6, the method of training the deep learning model includes operations S641 to S643. The deep learning model comprises a feature extraction module, a pooling module and a full-connection module.

In operation S641, an enhanced fusion feature map is obtained according to the enhanced vector feature map of the enhanced image.

The enhanced vector feature map may be obtained by inputting the enhanced image into a feature extraction module, and inputting the enhanced vector feature map into a pooling module may obtain an enhanced fusion feature map.

In the embodiment of the present disclosure, the manner of obtaining the enhanced vector feature map and the enhanced fusion feature map is similar to the manner of obtaining the initial vector feature map and the enhanced fusion feature map, and is not described here again.

In operation S642, a second classification difference value of the target object is obtained according to the enhanced fusion feature map.

For example, the enhanced fusion feature map may be input into the fully-connected module, and the output of the fully-connected module is input into the classification network, so as to obtain an enhanced classification result value for the target object, and a second classification difference value of the target object may be determined according to the enhanced classification result value.

In operation S643, parameters of the feature extraction module, the pooling module, and the full-connection module are adjusted according to the second classification difference value.

In the embodiment of the disclosure, the second classification difference value of the target object is determined based on the enhanced image, and the parameters of each module of the deep learning model are adjusted according to the second classification difference value, so that when the enhanced image is used for training the model, the model can learn more stable and accurate characteristics, and the model identification accuracy is improved.

An example implementation of operation S642 described above will be described below with reference to fig. 7A and 7B.

Fig. 7A is a flowchart of a method of determining a second classification difference value of a target object according to an embodiment of the present disclosure.

As shown in fig. 7A, the method of determining the second classification difference value of the target object includes operations S7421 to S7422.

In operation S7421, an enhanced classification result value of the target object is determined according to the enhanced fusion feature map.

For example, the enhanced fusion feature map may be input into a fully connected module, and the output of the fully connected module is input into a classification network, so as to obtain an enhanced classification result value of the target object.

In operation S7422, a second classification difference value of the target object is determined according to the initial classification result value and the enhanced classification result value.

As already described, the initial classification result value for the target object can be obtained by inputting the fused feature map into the fully connected module and inputting the output of the fully connected module into the classification network.

According to an embodiment of the present disclosure, a second classification difference value of the target object may be determined according to a relative entropy between the initial classification result value and the enhanced classification result value.

Specifically, the relative entropy between the initial classification result value and the enhanced classification result value can be calculated using the following formula (12):

in the formula (12), L_PCRepresenting the relative entropy between the initial classification result value and the enhanced classification result value, N being the number of sample images in a batch,

which represents the value of the result of the initial classification,

indicating an enhanced classification result value.

In the embodiment of the present disclosure, the relative entropy between the initial classification result value and the enhanced classification result value may be used as the second classification difference value of the target object, so as to adjust the parameters of each module of the deep learning model.

Fig. 7B is a flowchart of a method of determining a second classification difference value of a target object according to another embodiment of the present disclosure.

As shown in fig. 7B, the method of determining the second classification difference value of the target object includes operations S7423 to S7424.

In operation S7423, an enhanced classification result value of the target object is determined according to the enhanced fusion feature map.

In this operation, determining the enhanced classification result value of the target object is the same as or similar to the method described above, and is not described herein again.

In operation S7424, a second classification difference value of the target object is determined according to the label and the enhanced classification result value.

And obtaining a second classification difference value of the target object according to the enhanced classification result value and the label value corresponding to the label of the target object.

In the embodiment of the present disclosure, the following formula (13) may be adopted to calculate the second classification difference value of the target object in the sample image:

in the formula (13), the first and second groups,

a second classification difference value, y, representing the target object in the jth sample image_iA label value corresponding to a label representing a target object in the ith sample image,

and representing the enhanced classification result value of the target object in the ith sample image.

Based on the formula, the second classification difference value of the target object can be accurately determined, so that the accuracy of model training is improved.

Fig. 8A, 8B, and 8C are schematic diagrams of a training method of a deep learning model according to an embodiment of the present disclosure, and a scheme of the present disclosure will be explained below with reference to fig. 8A to 8C.

As shown in fig. 8A-8C, in an embodiment of the present disclosure, deep learning model 800 includes a feature extraction module 810(810 '), a pooling module 820 (820'), a fully connected module 830(830 '), and a classifier 840 (840'). The training method of the deep learning model according to the embodiment of the disclosure can complete the training of the feature extraction module 810, the pooling module 820, the full-connection module 830 and the classifier 840 until the modules converge.

As shown in fig. 8A, the feature extraction module 810 is used to perform a feature extraction operation on the sample image R containing the target object and the label of the target object, so as to obtain an initial vector feature map Fr of the sample image R. Performing pooling operation on the initial vector feature map Fr by using a pooling module 820 to obtain a first pooled feature map, processing the first pooled feature map by using a first weight U to obtain a weighted pooled feature map, and fusing the first pooled feature map and the weighted pooled feature map to obtain a fused feature map Er. And inputting the fused feature map Er into the full-connection module 830, and inputting the result output by the full-connection module 830 into the classifier 840 to obtain an initial classification result value Cr of the target object. And calculating 850 to obtain a first classification difference value Lr according to the initial classification result value Cr and the label value corresponding to the label of the target object. The enhanced image D can be determined 860 according to the first classified difference value Lr, the first weight U, and the initial vector feature map Fr obtained above. The enhanced image D can enhance potential forged features in the pooled feature map, so that the model focuses more on the potential forged features, and the training accuracy of the model is improved.

As shown in fig. 8B, after determining the enhanced image D, a feature extraction module 810' may be used to perform a feature extraction operation on the enhanced image D, resulting in an enhanced fused feature map Fd. The enhanced fused feature map Fd and the first weight U are input to the pooling module 820' to obtain an enhanced fused feature map Ed, which is obtained in a manner similar to the manner in which the fused feature map Er is obtained, and will not be described in detail herein. And inputting the enhanced fusion feature map Ed into a full-connection module 830 ', and inputting the result output by the full-connection module 830 ' into a classifier 840 ' to obtain an enhanced classification result value Cd of the target object. According to the enhanced classification result value Cd and the label of the target object, a second classification difference value Ld can be obtained by calculating 870 according to a formula (13), and the calculated second classification difference value Ld can be used for adjusting parameters of the feature extraction module 810, the pooling module 820, the full-connection module 830 and the classifier 840.

Fig. 8C is a diagram illustrating an example of a method of determining a second classification difference value of a target object according to another embodiment.

As shown in fig. 8C, after obtaining the enhanced classification result value Cd of the target object, the relative entropy between the initial classification result value Cr and the enhanced classification result value Cd may be calculated 870' according to the enhanced classification result value Cd and the initial classification result value Cr and according to the formula (12). In the embodiment of the present disclosure, the above-mentioned relative entropy may be used as a second classification difference value Ld' for adjusting parameters of the feature extraction module 810, the pooling module 820, the full-connection module 830 and the classifier 840.

In the scheme of the embodiment of the disclosure, the enhanced image is determined by using the first classification difference value and the initial vector characteristic of the target object, and the deep learning model is trained by using the enhanced image, in the process, a large number of network training parameters are not required to be introduced, so that the calculated amount in the model training process is reduced, and the deep learning model is trained by using the enhanced image, so that the model learning is more robust in counterfeit characteristics, and the accuracy of model identification is improved.

According to the embodiment of the disclosure, the deep learning model can be trained by using a plurality of batches of sample images until the model converges. The process of training the model using each sample image is the same as or similar to the process described above and will not be described in detail here. The trained deep learning model may be used for recognition of a target object, and a target object recognition method will be described below with reference to fig. 9.

Fig. 9 is a flowchart of a target object identification method according to an embodiment of the present disclosure.

As shown in fig. 9, the target object recognition method 900 includes operations S910 to S930.

In operation S910, an image to be recognized is acquired.

The image to be recognized may be any one or more frames of images in a video stream acquired by a camera, or may be acquired in other manners, which is not limited in this disclosure.

According to an embodiment of the present disclosure, the image to be recognized includes one or more target objects, which may include, for example, a face, or other objects, and is not limited herein.

In operation S920, a deep learning model is acquired.

According to the embodiment of the present disclosure, the deep learning model referred to herein is trained based on the training method of the deep learning model described in any one of the above embodiments.

In operation S930, the image to be recognized is input into the depth learning model, and a recognition result of the target object in the image to be recognized is obtained.

And inputting the image to be recognized into the depth learning model to obtain a recognition result of the target object in the image to be recognized. In an example where the target object is a face, the recognition result referred to herein may include a real face and a fake face.

In the scheme of the embodiment of the disclosure, the image to be recognized is recognized by using the deep learning model obtained by training in the above manner, so that the model can learn more robust counterfeit features, and the accuracy of model recognition is improved.

Fig. 10 is a block diagram of a training apparatus for a deep learning model according to an embodiment of the present disclosure.

As shown in fig. 10, the training apparatus 1000 for deep learning model includes a fusion module 1010, a calculation module 1020, an enhancement module 1030, and a training module 1040.

The fusion module 1010 is configured to obtain a fusion feature map according to the initial vector feature map of the sample image, where the sample image includes the target object and the label of the target object.

The calculating module 1020 is configured to obtain a first classification difference value of the target object according to the fusion feature map and the label.

The enhancement module 1030 is configured to determine an enhanced image corresponding to the sample image according to the first classified difference value and the initial vector feature map. And

the training module 1040 is configured to train the deep learning model using the enhanced images.

According to an embodiment of the present disclosure, the fusion module 1010 includes a pooling sub-module, a weighting sub-module, and a first fusion sub-module.

And the pooling submodule is used for executing pooling operation on the initial vector feature map to obtain a first pooled feature map.

And the weighting submodule is used for processing the first pooling feature map by using the first weight to obtain a weighted pooling feature map. And

and the first fusion submodule is used for obtaining a fusion feature map according to the first pooling feature map and the weighted pooling feature map.

According to an embodiment of the present disclosure, the above-described first weight indicates a distribution probability of the feature of the target object.

According to an embodiment of the present disclosure, the boost module 1030 includes an attention sub-module, a selection sub-module, a generation sub-module, and a boost sub-module.

The attention submodule is used for determining an attention feature map according to the initial vector map and the first weight.

The selection submodule is used to determine a predetermined number of elements of the attention feature map having a larger value.

The generation submodule is used for obtaining an enhancement matrix according to a preset number of elements. And

and the enhancement submodule is used for determining an enhanced image according to the enhancement matrix and the sample image.

According to an embodiment of the disclosure, the selection submodule comprises a selection unit for determining the predetermined number in dependence on a selection threshold.

According to an embodiment of the present disclosure, the sample image includes N sample images, N being an integer greater than 1. The apparatus 1000 further comprises a first determining module and a second determining module.

The first determining module is configured to determine an average first classified difference value of the N first classified difference values of the N sample images. And

the second determining module is used for determining a selection threshold value according to the average first classification difference value.

According to an embodiment of the present disclosure, the enhancer module includes a smoothing unit and an enhancement unit.

The smoothing unit is used for smoothing the sample image to obtain a smooth sample image. And the enhancement unit is used for determining an enhanced image according to the smooth sample image, the sample image and the enhancement matrix.

According to an embodiment of the present disclosure, the deep learning model includes a feature extraction module, a pooling module and a full-connection module, and the training module 1040 includes a second fusion submodule, a calculation submodule and an adjustment submodule.

And the second fusion submodule is used for obtaining an enhanced fusion characteristic diagram according to the enhanced vector characteristic diagram of the enhanced image.

And the calculation submodule is used for obtaining a second classification difference value of the target object according to the enhanced fusion feature map and the label. And

and the adjusting submodule is used for adjusting parameters of the feature extraction module, the pooling module and the full-connection module according to the second classification difference value.

According to an embodiment of the present disclosure, the calculation submodule includes a first calculation unit and a second calculation unit.

The first calculation unit is used for determining an enhanced classification result value of the target object according to the enhanced fusion feature map.

The second calculating unit is used for determining a second classification difference value of the target object according to the initial classification result value and the enhanced classification result value, and the initial classification result value is obtained based on the fusion feature map.

According to an embodiment of the present disclosure, the second calculating unit includes a calculating subunit configured to determine a second classification difference value of the target object according to a relative entropy between the initial classification result value and the enhanced classification result value.

According to an embodiment of the present disclosure, the calculation submodule includes a third calculation unit and a fourth calculation unit.

And the third calculating unit is used for determining an enhanced classification result value of the target object according to the enhanced fusion feature map. And

and the fourth calculating unit is used for determining a second classification difference value of the target object according to the label and the enhanced classification result value.

Fig. 11 is a block diagram of a target object recognition apparatus according to an embodiment of the present disclosure.

As shown in fig. 11, the target object recognition apparatus 1100 includes a first obtaining module 1110, a second obtaining module 1120, and a recognition module 1130.

The first obtaining module 1110 is configured to obtain an image to be recognized.

The second obtaining module 1120 is configured to obtain a deep learning model, where the deep learning model is obtained by training based on a deep learning model training apparatus in any one of the above embodiments.

The recognition module 1130 is configured to input the image to be recognized into the deep learning model, so as to obtain a recognition result of the target object in the image to be recognized.

According to the embodiment of the present disclosure, the target object includes a face, and the recognition result includes a real face and a fake face.

It should be noted that the implementation, solved technical problems, implemented functions, and achieved technical effects of each module/unit/subunit and the like in the apparatus part embodiment are respectively the same as or similar to the implementation, solved technical problems, implemented functions, and achieved technical effects of each corresponding step in the method part embodiment, and are not described herein again.

In the technical scheme of the disclosure, the processes of collecting, storing, using, processing, transmitting, providing, disclosing and the like of the personal information of the related user all accord with the regulations of related laws and regulations, and do not violate the common customs of public order.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

FIG. 12 shows a schematic block diagram of an example electronic device 1200, which can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 12, the apparatus 1200 includes a computing unit 1201 that can perform various appropriate actions and processes in accordance with a computer program stored in a Read Only Memory (ROM)1202 or a computer program loaded from a storage unit 1208 into a Random Access Memory (RAM) 1203. In the RAM 1203, various programs and data required for the operation of the device 1200 may also be stored. The computing unit 1201, the ROM 1202, and the RAM 1203 are connected to each other by a bus 1204. An input/output (I/O) interface 1205 is also connected to bus 1204.

Various components in the device 1200 are connected to the I/O interface 1205 including: an input unit 1206 such as a keyboard, a mouse, or the like; an output unit 1207 such as various types of displays, speakers, and the like; a storage unit 1208, such as a magnetic disk, optical disk, or the like; and a communication unit 1209 such as a network card, modem, wireless communication transceiver, etc. The communication unit 1209 allows the device 1200 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

The computing unit 1201 may be a variety of general purpose and/or special purpose processing components having processing and computing capabilities. Some examples of the computing unit 1201 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 1201 performs the respective methods and processes described above, such as the training method of the deep learning model and the target object recognition method. For example, in some embodiments, the training method of the deep learning model and the target object recognition method may be implemented as computer software programs tangibly embodied on a machine-readable medium, such as storage unit 1208. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 1200 via the ROM 1202 and/or the communication unit 1209. When the computer program is loaded into the RAM 1203 and executed by the computing unit 1201, one or more steps of the training method of the deep learning model and the target object recognition method described above may be performed. Alternatively, in other embodiments, the computing unit 1201 may be configured by any other suitable means (e.g., by means of firmware) to perform the training method of the deep learning model and the target object recognition method.

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program code, when executed by the processor or controller, causes the functions/acts specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A training method of a deep learning model comprises the following steps:

training the deep learning model using the enhanced image.

2. The method of claim 1, wherein the deriving a fused feature map from the initial vector feature map of the sample image comprises:

performing pooling operation on the initial vector feature map to obtain a first pooled feature map;

processing the first pooling feature map by using a first weight to obtain a weighted pooling feature map; and

and obtaining the fusion feature map according to the first pooling feature map and the weighted pooling feature map.

3. The method of claim 2, wherein the first weight indicates a distribution probability of a feature of a target object.

4. The method of claim 2, wherein the determining an enhanced image corresponding to the sample image from the first classified difference values and the initial vector feature map comprises:

determining an attention feature map according to the initial vector map and the first weight;

determining a preset number of elements with larger numerical values in the attention feature map;

obtaining an enhancement matrix according to the predetermined number of elements; and

and determining the enhanced image according to the enhancement matrix and the sample image.

5. The method of claim 4, wherein the determining the predetermined number of elements of the attention feature map having a larger numerical value comprises:

determining the predetermined number according to a selection threshold.

6. The method of claim 5, wherein the sample images comprise N sample images, N being an integer greater than 1; the method further comprises the following steps:

determining an average first classified difference value of the N first classified difference values of the N sample images; and

and determining the selection threshold according to the average first classification difference value.

7. The method of claim 4, wherein the determining the enhanced image from the enhancement matrix and the sample image comprises:

smoothing the sample image to obtain a smooth sample image; and

determining the enhanced image from the smoothed sample image, the sample image, and the enhancement matrix.

8. The method of claim 1, wherein the deep learning model comprises a feature extraction module, a pooling module, and a fully connected module; the training the deep learning model using the enhanced image comprises:

obtaining an enhanced fusion feature map according to the enhanced vector feature map of the enhanced image;

obtaining a second classification difference value of the target object according to the enhanced fusion feature map and the label; and

and adjusting parameters of a feature extraction module, a pooling module and a full-connection module according to the second classification difference value.

9. The method according to claim 8, wherein the obtaining of the second classification difference value of the target object according to the enhanced fusion feature map comprises:

determining an enhanced classification result value of the target object according to the enhanced fusion feature map;

and determining a second classification difference value of the target object according to an initial classification result value and the enhanced classification result value, wherein the initial classification result value is obtained based on the fusion feature map.

10. The method of claim 9, wherein the determining a second classification difference value for the target object based on the initial classification result value and the enhanced classification result value comprises:

and determining a second classification difference value of the target object according to the relative entropy between the initial classification result value and the enhanced classification result value.

11. The method according to claim 8, wherein the obtaining of the second classification difference value of the target object according to the enhanced fusion feature map comprises:

determining an enhanced classification result value of the target object according to the enhanced fusion feature map; and

and determining a second classification difference value of the target object according to the label and the enhanced classification result value.

12. A target object identification method, comprising:

inputting an image to be recognized into a deep learning model to obtain a recognition result of a target object in the image to be recognized,

wherein the deep learning model is trained using the method of any one of claims 1 to 11.

13. The method of claim 12, wherein the target object includes a face, and the recognition result includes a real face and a fake face.

14. A training apparatus for deep learning models, comprising:

the enhancement module is used for determining an enhanced image corresponding to the sample image according to the first classified difference value and the initial vector feature map; and

15. The apparatus of claim 14, wherein the fusion module comprises:

the pooling submodule is used for performing pooling operation on the initial vector feature map to obtain a first pooled feature map;

the weighting submodule is used for processing the first pooling feature map by using a first weight to obtain a weighted pooling feature map; and

and the first fusion submodule is used for obtaining the fusion feature map according to the first pooling feature map and the weighted pooling feature map.

16. The apparatus of claim 15, wherein the first weight indicates a distribution probability of a feature of a target object.

17. The apparatus of claim 15, wherein the boost module comprises:

the attention submodule is used for determining an attention feature map according to the initial vector map and the first weight;

a selection submodule for determining a predetermined number of elements of the attention feature map having a larger numerical value;

generating a submodule for obtaining an enhancement matrix according to the predetermined number of elements; and

an enhancer module for determining the enhanced image according to the enhancement matrix and the sample image.

18. The apparatus of claim 17, wherein the selection submodule comprises:

a selection unit for determining the predetermined number according to a selection threshold.

19. The apparatus of claim 18, wherein the sample images comprise N sample images, N being an integer greater than 1; the device further comprises:

a first determining module for determining an average first classification difference value of N first classification difference values of the N sample images; and

and the second determining module is used for determining the selection threshold according to the average first classification difference value.

20. The apparatus of claim 17, wherein the enhancement submodule comprises:

the smoothing unit is used for smoothing the sample image to obtain a smooth sample image; and

an enhancement unit configured to determine the enhanced image according to the smoothed sample image, the sample image, and the enhancement matrix.

21. The apparatus of claim 14, wherein the deep learning model comprises a feature extraction module, a pooling module, and a fully connected module; the training module comprises:

the second fusion submodule is used for obtaining an enhanced fusion characteristic diagram according to the enhanced vector characteristic diagram of the enhanced image;

the calculation submodule is used for obtaining a second classification difference value of the target object according to the enhanced fusion characteristic diagram and the label; and

22. The apparatus of claim 21, wherein the computation submodule comprises:

the first calculation unit is used for determining an enhanced classification result value of the target object according to the enhanced fusion feature map;

and the second calculating unit is used for determining a second classification difference value of the target object according to an initial classification result value and the enhanced classification result value, wherein the initial classification result value is obtained based on the fusion feature map.

23. The apparatus of claim 22, wherein the second computing unit comprises:

and the calculating subunit is used for determining a second classification difference value of the target object according to the relative entropy between the initial classification result value and the enhanced classification result value.

24. The apparatus of claim 21, wherein the computation submodule comprises:

the third calculation unit is used for determining an enhanced classification result value of the target object according to the enhanced fusion feature map; and

25. A target object recognition apparatus comprising:

the recognition module is used for inputting the image to be recognized into the depth learning model to obtain the recognition result of the target object in the image to be recognized,

wherein the deep learning model is trained by using a training device of the deep learning model of any one of claims 14 to 24.

26. The apparatus of claim 25, wherein the target object comprises a face, and the recognition result comprises a real face and a fake face.

27. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1 to 13.

28. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1 to 13.

29. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1 to 13.