CN111626400B

CN111626400B - Training and application method and device for multi-layer neural network model and storage medium

Info

Publication number: CN111626400B
Application number: CN201910149265.0A
Authority: CN
Inventors: 李岩; 黄耀海; 赵东悦
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2019-02-28
Filing date: 2019-02-28
Publication date: 2024-03-15
Anticipated expiration: 2039-02-28
Also published as: CN111626400A

Abstract

The present disclosure provides a method, apparatus, and storage medium for training and applying a multi-layer neural network model. The method and the device utilize the multi-layer output feature map in the network model to carry out target identification, so that the accuracy of target identification is improved.

Description

Training and application method and device for multi-layer neural network model and storage medium

Technical Field

The present disclosure relates to the field of modeling for multi-layer neural networks, and more particularly to a multitasking application using a multi-layer neural network model.

Background

In recent years, a modeling-based multi-layer neural network model has been widely used in computer business such as face detection, vehicle detection, and the like. In these detection services for objects (faces, vehicles), determining the position of the object in the image is important for the subsequent object recognition services such as attribute recognition for the object and feature point recognition for the object. In the conventional method, network models need to be trained for different services, for example, network models need to be trained for face detection and face recognition respectively, and the output of the face detection network model needs to be re-matched before being used as the input of the face recognition network model, which can negatively affect the simplification of the network model and the service accuracy. In this regard, the industry has proposed a network model that can perform multiple tasks simultaneously, thereby enabling a smaller number of learnable parameters to be used, enabling multiple tasks with less computational effort and higher accuracy.

Fig. 1 (a) shows a network model based on deep learning and capable of performing multitasking, taking human body detection as an example, as shown in a flowchart of fig. 1 (b), when an original image is input into the network model, the image propagates forward in the network model in the form of a feature map, an output feature map of an nth layer is extracted, and a sub-region feature map in the extracted output feature map is determined according to a predetermined candidate region frame (rectangular frame for each human body (target)), wherein the sub-region feature map includes a human body region determined by the candidate region frame. The sub-region feature map is input into a target detection network model capable of performing human body detection and a target recognition network model capable of performing human body recognition. On one hand, the number and the positions of human bodies in the image can be detected through the target detection network model; on the other hand, human body recognition may be performed through the object recognition network model, for example, a head of a human body is recognized, or a sex or age range of a human is recognized, or the like.

When the original image propagates forward in the network model, if a pooling layer exists in the network model, the feature map becomes smaller in size after the pooling process of the pooling layer. If the size of the sub-region feature map determined according to the output feature map of the nth layer is small according to the method shown in fig. 1 (a) and 1 (b), useful information contained therein is also small, which makes the accuracy of detection and recognition of the target not high, resulting in that the final output target detection result and target recognition result are not ideal.

Disclosure of Invention

The present disclosure is directed to improving the accuracy of target recognition results for neural network models capable of multitasking.

According to an aspect of the present disclosure, there is provided an application method of a multi-layer neural network model, the application method including: extracting output feature graphs of at least two layers in a multi-layer neural network model based on images input to the network model; obtaining a sub-region feature map corresponding to the same target in each extracted output feature map according to a predetermined candidate region frame; and performing target identification by using the obtained multiple sub-region feature maps.

According to another aspect of the present disclosure, there is provided a training method of a multi-layer neural network model, the network model including a first sub-network and a second sub-network, the first sub-network and the second sub-network respectively including at least one layer, and the first sub-network having a higher layer level in the network model than the second sub-network; the network model also comprises a third sub-network for performing target detection and a fourth sub-network for performing target identification; the training method comprises the following steps: inputting an output characteristic diagram of the second sub-network into the third sub-network, and training the first sub-network, the second sub-network and the third sub-network according to a target detection result of the third sub-network and a predetermined detection true value; and inputting the output characteristic diagram of the first sub-network into the fourth sub-network, and training the fourth sub-network according to the target identification result of the fourth sub-network and a predetermined identification true value.

According to another aspect of the present disclosure, there is provided a method for applying a multi-layer neural network model, the network model including a first sub-network and a second sub-network, the first sub-network and the second sub-network respectively including at least one layer, and the first sub-network having a higher layer level in the network model than the second sub-network; the network model also comprises a third sub-network for performing target detection and a fourth sub-network for performing target identification; the application method comprises the following steps: inputting the output characteristic diagram of the second sub-network into a third sub-network so that the third sub-network performs a target detection task based on the received characteristic diagram; and inputting the output characteristic diagram output by the first sub-network into a fourth sub-network so that the fourth sub-network performs a target recognition task based on the received characteristic diagram.

According to another aspect of the present disclosure, there is provided an application method of a multi-layer neural network model, determining a plurality of candidate region frames for a target in an image to be identified; extracting a corresponding sub-region feature map based on at least two candidate region frames aiming at the same target in the determined multiple candidate region frames; performing target identification by using the extracted sub-region feature map; fusing the recognition results of target recognition, and taking the fused recognition results as final recognition results.

According to another aspect of the present disclosure, there is provided an application apparatus of a multi-layer neural network model, the application apparatus including: a feature map extraction unit configured to extract an output feature map of at least two layers in a multi-layer neural network model based on an image input to the network model; a sub-region feature map acquisition unit configured to acquire sub-region feature maps corresponding to the same target in the extracted output feature maps according to predetermined candidate region frames; and an object recognition unit configured to perform object recognition using the obtained plurality of sub-region feature maps.

According to another aspect of the present disclosure, there is provided a training apparatus of a multi-layer neural network model, the network model including a first sub-network and a second sub-network, the first sub-network and the second sub-network respectively including at least one layer, and the first sub-network having a higher layer level in the network model than the second sub-network; the network model also comprises a third sub-network for performing target detection and a fourth sub-network for performing target identification; the training device comprises: a first training unit configured to input an output feature map of a second sub-network to the third sub-network, the first sub-network, the second sub-network, and the third sub-network being trained according to a target detection result of the third sub-network and a predetermined detection true value; and a second training unit configured to input an output feature map of the first sub-network to the fourth sub-network, the fourth sub-network being trained according to a target recognition result of the fourth sub-network and a predetermined recognition true value.

According to another aspect of the present disclosure, there is provided an application apparatus of a multi-layer neural network model, the network model including a first sub-network and a second sub-network, the first sub-network and the second sub-network respectively including at least one layer, and the first sub-network having a higher layer level in the network model than the second sub-network; the network model also comprises a third sub-network for performing target detection and a fourth sub-network for performing target identification; the application device comprises: a target detection unit configured to input an output feature map of the second sub-network to the third sub-network, so that the third sub-network performs a target detection task based on the received feature map; and a target recognition unit configured to input the output feature map output by the first sub-network to the fourth sub-network, so that the fourth sub-network performs a target recognition task based on the received feature map.

According to another aspect of the present disclosure, there is provided an application apparatus of a multi-layer neural network model, the application apparatus including: a candidate region frame determination unit configured to determine a plurality of candidate region frames for a target in an image to be identified; a sub-region feature map extraction unit configured to extract a corresponding sub-region feature map based on at least two candidate region frames for the same target among the determined plurality of candidate region frames; a target recognition unit configured to perform target recognition using the extracted sub-region feature map; and a fusion unit configured to fuse the recognition results of the object recognition and to take the fused recognition results as final recognition results.

According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing instructions that, when executed by a computer, cause the computer to perform a method of applying a multi-layer neural network model.

According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing instructions that, when executed by a computer, cause the computer to perform a method of training a multi-layer neural network model.

Other features of the present disclosure will become apparent from the following description of exemplary embodiments with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description of the embodiments, serve to explain the principles of the disclosure.

Fig. 1 (a) shows a known multitasking multi-layer neural network model and the traffic scenario to which it is applied.

Fig. 1 (b) shows a flowchart when the multi-layer neural network model shown in fig. 1 (a) performs multi-tasking.

Fig. 2 shows a scaled-down example of the feature map in the network model and the original image.

Fig. 3 is a hardware schematic of a network model running the present disclosure.

Fig. 4 is a flowchart illustrating steps of a method for applying a multi-layer neural network model according to a first exemplary embodiment of the present disclosure.

Fig. 5 and 6 are application examples of the first exemplary embodiment of the present disclosure.

Fig. 7 shows the variation of the geometric and semantic information of the feature map in different layers in the network model.

Fig. 8 is a flowchart illustrating steps of a training method for a multi-layer neural network model according to a second exemplary embodiment of the present disclosure.

Fig. 9 and 10 are diagrams of structural examples of a multi-layer neural network model according to a second exemplary embodiment of the present disclosure.

Fig. 11 (a) and 11 (b) are flowcharts of the steps of the method of applying the multilayer neural network model according to the second exemplary embodiment of the present disclosure.

Fig. 12 shows a known face recognition technique.

Fig. 13 is a flowchart illustrating steps of a method for applying a multi-layer neural network model according to a third exemplary embodiment of the present disclosure.

Fig. 14 shows the fusion of face feature point coordinate values.

Fig. 15 shows fusion of face attribute result predictions.

Detailed Description

In the method for applying the multi-tasking neural network model shown in fig. 1 (a) and 1 (b), on the one hand, in order to reduce the complexity of the operation, the scale of the input image is reduced in advance, so that the size of the actual input image is reduced and the pixel points included in the actual input image are reduced; on the other hand, in the case where the network model includes one or more Pooling layers, the feature map may also decrease in size after being subjected to Pooling (Pooling) processing by the Pooling layers. Taking the image shown in fig. 2 as an example, assuming that the size of the original input image is 1920×1080 and the size of the face (target) in the image is 40×40, in order to reduce the amount of computation and to maintain the characterization (reproduction) of the face, the size of the original input image is 960×540 and the size of the face in the image is 20×20 after the image size adjustment (restoration) process. The downscaled image is input into a multi-layer neural network model, and forward propagation is carried out in the network model in the form of a characteristic diagram. When the feature map is subjected to pooling processing by the pooling layers, the feature map will be further reduced in size (2 pooling layers are shown in fig. 2). For example, the size of the feature map finally output after the pooling process is 60×34, and at this time, the size of the face is smaller than 2×2. If the finally output characteristic diagram is extracted, the size of the determined face is smaller than 2 multiplied by 2, and the detection and recognition results of the face are not good enough because the size of the face in the characteristic diagram is undersized; especially in the aspect of face recognition, the information loss in the feature map is large, so that the feature point detection and the face attribute detection results of the face are poor.

In order to improve the accuracy of the target detection and target recognition results, the exemplary embodiment of the present disclosure proposes an application optimization scheme of a multi-layer neural network model, which is different from the schemes shown in fig. 1 (a) and 1 (b) in that, in the scheme of the present embodiment, instead of extracting an output feature map of a certain layer to perform target detection and target recognition, a multi-layer output feature map is extracted, sub-region feature maps of a plurality of output feature maps for the same target are fused together, and the fused sub-region feature maps are used to perform target detection and target recognition. Because the fused subarea feature map contains more useful information, the results of target detection and target identification are better.

Various exemplary embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings. It should be understood that the present disclosure is not limited to the various exemplary embodiments described below. In addition, as a solution to the problems of the present disclosure, it is not necessary to include a combination of the features described in all the exemplary embodiments.

FIG. 3 illustrates a hardware environment for running a multi-layer neural network model, including: a processor unit 10, an internal memory unit 11, a network interface unit 12, an input unit 13, an external memory 14 and a bus unit 15.

The processor unit 10 may be a CPU or GPU. The internal memory unit 11 includes a Random Access Memory (RAM), a Read Only Memory (ROM). The RAM may be used as a main memory, a work area, etc. of the processor unit 10. The ROM may be used to store a control program for the processor unit 10, and may also be used to store files or other data to be used when running the control program. The network interface unit 12 may be connected to a network and implement network communications. The input unit 13 controls input from a keyboard, a mouse, or the like. The external memory 14 stores a startup program, various applications, and the like. The bus unit 15 is used to connect the units in the optimization device of the multi-layer neural network model.

Hereinafter, embodiments of the present disclosure will be described in detail with reference to the accompanying drawings.

< first exemplary embodiment >

Fig. 4 depicts a flowchart of the steps of a method of applying a neural network model capable of performing multitasking in accordance with a first exemplary embodiment of the present disclosure. In the first embodiment, the processing flow of the multilayer neural network model shown in fig. 4 is implemented by causing the CPU 10 to execute a program stored in the ROM and/or the external memory 14 by using the RAM as a working memory.

Step S101: the original image is input to the multi-layer neural network model.

The multi-layer neural network model described in the present disclosure refers to a network model capable of performing multitasking, for example, a network model capable of performing object detection and object recognition (including feature point detection of an object, attribute detection of an object, and the like) on an image. Taking face detection as an example, the detection of a face includes, but is not limited to, detecting the number of faces contained in an image and the positions of the faces in the image, and the identification of a face includes, but is not limited to, determining the positions of feature points (eyes, nose, etc.) on the face, and determining the intrinsic properties (gender, age, etc.) of the face. Further taking vehicle detection as an example, the detection of the vehicle includes, but is not limited to, detecting the number of vehicles contained in the image and the positions of the vehicles in the image, the identification of the vehicle includes, but is not limited to, the detection of feature points (license plates ) of the vehicle, and the determination of the type of the vehicle (small vehicle, medium vehicle, large vehicle, etc.). The present embodiment is applicable to the first embodiment without limiting the structure of the network model and the training method of the network model in advance, and the network model which is trained in any training method and can perform target detection and target recognition on the image. Preferably, the multi-layer neural network model in the first embodiment may be a convolutional neural network model, and the structure may be VGG16, resNet, SENet, or the like.

The multi-layer neural network model herein may include a plurality of sub-networks, and in particular, may include the following sub-networks:

1. an RPN (Region Proposal Networks, regional proposal network) subnetwork for determining candidate region boxes and corresponding confidence levels, wherein each candidate region box represents a region in which a target is located. The RPN subnetwork herein may be a full convolutional network.

2. The first subnetwork extracts the multi-layered output feature map therefrom to perform the following step S103. The first subnetwork herein may convolve a neural network.

3. And the second sub-network is used for carrying out target detection. The second subnetwork may be a FD (Facial Detection) subnetwork when face detection is performed.

4. And a third sub-network for performing object recognition. The third subnetwork may be a FFPD (Facial feature point Detection) subnetwork when feature point detection on the face is performed.

The "first", "second" and "third" subnetworks herein are used to distinguish three subnetworks that do not overlap each other, without limiting the structure and performance of the subnetworks.

The multi-layer neural network model is a network model capable of performing multiple tasks, so that a sub-network for performing other tasks can be designed according to actual needs, and will not be described herein.

Step S102: at least one candidate region box is determined.

In step S102, after the input image completes forward propagation in the first sub-network, the output feature map is input into the RPN sub-network, and the candidate region frame and the corresponding confidence level are determined by using a known RPN technique. Of course, the first embodiment is not limited to other methods for determining the candidate region frame, and if the candidate region frame is determined in other manners instead of using the RPN, the multi-layer neural network model may also include other sub-networks for determining the candidate region frame instead of the RPN sub-network.

The information for each candidate region box characterizes the information for one target. For example, the candidate region frame may be represented by (x, y, h, w), where (x, y) represents the coordinates of the lower left corner of the candidate region frame (or other vertex of the rectangular frame), and (h, w) represents the height and width of the candidate region frame, whereby the position and size of the candidate region frame corresponds to the region in which one target is located. When multiple targets are included in the image, at least one candidate region box may be determined for each target separately. The smallest rectangular frame containing the human body as shown in fig. 1 (a) is the candidate region frame.

Step S103: and extracting output feature graphs of at least two layers in the network model, and obtaining a sub-region feature graph corresponding to the same target in each extracted output feature graph according to the determined candidate region frame.

After the candidate region frame and the original image determined in step S102 are input again into the first sub-network, the original image is propagated forward in the form of a feature map in the first sub-network. Taking the case shown in fig. 5 as an example, it is assumed that there are three pooling layers in the first sub-network, and of course, other layers, such as a convolution layer, an activation layer, a quantization layer, etc., may also be included in the first sub-network, which is not shown in fig. 5. To represent the feature map size, the dimension is gradually reduced after pooling, only three pooling layers are shown in fig. 5. In step S103, taking the output feature maps (output feature maps 1 to 3) of the three layers of the pooling layer 1, the pooling layer 2, and the pooling layer 3 as an example, it is assumed that the channel numbers of the three output feature maps are 16, 24, and 48 (indicated by ellipses in fig. 5) and the sizes of the three output feature maps are 60×60, 30×30, and 15×15, respectively. And determining a subarea feature map 1-a subarea feature map 3 corresponding to the same target in the output feature map 1-the output feature map 3 according to the determined candidate area frame, wherein the size of the subarea feature map 1 is 8 multiplied by 8, the size of the subarea feature map 2 is 4 multiplied by 4, and the size of the subarea feature map 3 is 2 multiplied by 2. In this way, three sub-region feature maps of different levels, different sizes describing the same object are obtained.

Fig. 5 is an example of the output feature map of the pooling layer, and the step S103 is not limited to the processing of extracting the output feature map of the previous layer of the pooling layer. For example, it may be an intermediate layer (not the last layer) of the network model, and an alternative way is to extract one layer of the output feature map from each of the first 1/3 layer, the middle 1/3 layer, and the later 1/3 layer of the first sub-network for processing. The advantage of this is that, since the geometric information of the output feature map gradually decreases from the upper layer to the lower layer of the network model, and the semantic information gradually increases from the upper layer to the lower layer of the network model, the output feature map extracted from the first 1/3 layer, the middle 1/3 layer and the rear 1/3 layer of the first sub-network represents the output feature map containing abundant geometric information, the output feature map with balanced geometric information and semantic information, and the output feature map containing abundant semantic information, so that each sub-region feature map for the same target can be better represented in the target detection task and the target recognition task.

Step S104: and fusing (connecting) the obtained sub-region feature graphs to obtain fused sub-region feature graphs.

Still taking the case shown in fig. 5 as an example, since the three sub-region feature maps are different in size, it is necessary to scale the three sub-region feature maps to the same size by a method such as a window-by-window difference. The scaling method is not limited in this embodiment, as long as the sizes of all the sub-region feature maps after scaling are the same, and an alternative method is to make the sizes of all the sub-region feature maps after scaling the same and the sizes are not smaller than the maximum size before scaling. For example, in the case shown in fig. 5, the sizes of the sub-region feature map 2 and the sub-region feature map 3 are enlarged to the same size as the sub-region feature map 1. For another example, the dimensions of the three sub-region feature maps are all enlarged to 16×16. And then fusing the three sub-area characteristic diagrams after the size scaling. An alternative fusion method is concatenation (concat), in the subarea feature map shown in fig. 5, the number of channels from the subarea feature map 1 to the subarea feature map 3 is 16, 24 and 48 respectively, and when the sizes of the three subarea feature maps are the same, the subarea feature map with 88 channels is obtained by concatenation. Another alternative fusion mode is bitwise addition in the feature graphs, when the channel numbers of the sub-region feature graphs to be fused are the same, assuming that the channel numbers of the sub-region feature graphs 1 to 3 are all 16, the sub-region feature graphs with the same size can be added bitwise to obtain the fused sub-region feature graphs containing 16 channels. Preferably, the sub-region feature maps corresponding to different levels are weighted using different weights when fusion is performed. For example, in performing face recognition, feature points (such as eyes, nose, etc.) of a face are advantageously recognized according to geometric information, and thus, a sub-region feature map including rich geometric information corresponds to a larger weight, i.e., the weight of a sub-region feature map output from an upper layer may be greater than that of a sub-region feature map output from a lower layer; for another example, when face detection is performed, foreground detection is facilitated according to semantic information, and then the position of the face is detected, so that the sub-region feature map containing rich semantic information corresponds to a larger weight, that is, the weight of the sub-region feature map output from an upper layer may be smaller than the weight of the sub-region feature map output from a lower layer. And fusing the sub-region feature maps based on the weighted values set for the sub-region feature maps to obtain the fused sub-region feature maps.

Step S105: and performing a target recognition task by using the fused sub-region feature map.

Compared with the method for detecting the sub-region feature map of the corresponding target in the output feature map of the first layer extracted in fig. 1 (a) and fig. 1 (b), the method for fusing the sub-region feature maps of the multi-layer output feature map according to the first embodiment can enable the accuracy of target detection and target identification to be higher.

Here, the method for performing the object detection and the object recognition by using the fused sub-region feature map may be a method known in the art, which is not limited in this embodiment. Still taking the case shown in fig. 5 as an example, the fused sub-region feature map may be input into a target recognition sub-network (e.g., an FFPD sub-network for face feature point detection) to detect the positions of feature points on the face. At this time, the sub-region feature map corresponding to the target in the output feature map of the tail layer of the first sub-network is not the fused sub-region feature map, but is input in the sub-network for target detection. In addition to the target recognition using the fused sub-region feature map and the target detection using the unfused sub-region feature map shown in fig. 5, in the solution of the first embodiment, the target detection and the target recognition may also be performed using the fused sub-region feature map. As shown in fig. 6, the fused sub-region feature map is simultaneously input to the FFPD sub-network for face feature point detection and the FD sub-network for face detection.

The effects of the known schemes shown in fig. 1 (a) and 1 (b) and the method shown in the first embodiment are detected by using a reduced face image including a plurality of front or half sides as experimental data. The truth value of each face detection frame and the truth value of 15 detection points on each face are determined in advance. After training based on the same data set, the results of performance comparison using the known method and the method of the first embodiment are shown in table 1.

TABLE 1

As can be seen from table 1, the optimization scheme of the first embodiment has a significant improvement in both Mean RMSE and CED of SAE compared with the known scheme, that is, the accuracy of identifying feature points on the target in the network model capable of performing the multitasking is improved, and the detection of the target with smaller size is also more robust. Although the time consuming forward propagation (Forward Propagation) increases, it increases only 3ms on average, with little impact on the overall performance of the network model. < second exemplary embodiment >

In the deep learning network model, the output feature maps of different levels correspond to different fine grain levels of the target respectively. As shown in fig. 7, when the original image is input into the multi-layer neural network model, the size of the feature map becomes smaller and smaller due to the pooling process as the feature map propagates forward in the network model. The geometric information of the feature map gradually decreases from the upper layer to the lower layer of the network model, and the semantic information gradually increases from the upper layer to the lower layer of the network model. For example, the output feature map in the first M layers of the network model contains abundant geometric information, and features extracted from such output feature map have inter-class discrimination, and are more suitable for classification and feature point detection services. For example, in face recognition, feature points (e.g., eyes, nose, etc.) of a face are advantageously recognized based on geometric information. For another example, the output feature map in the back K layer of the network model contains abundant semantic information, which is beneficial to foreground detection, for example, in face detection, the position of the face is beneficial to detection according to the semantic information.

In the conventional schemes of target detection and target recognition, the output feature maps of the same hierarchy are used as detection basis for different tasks, which may cause differences in the accuracy of detection results among different tasks. Therefore, if different tasks can be adapted to different levels of feature graphs, the results of the tasks can be better.

The second embodiment provides a training method and an application method of a multi-layer neural network model, so that different tasks can be processed by using different levels of feature graphs. Fig. 8 depicts a flowchart of the steps of a training method for a neural network model capable of performing multitasking in accordance with a second exemplary embodiment of the present disclosure. In the second embodiment, the training flow of the multilayer neural network model shown in fig. 8 is implemented by using the RAM as the working memory and causing the CPU 10 to execute the program stored in the ROM and/or the external memory 14.

Step S201: the image to be trained is input into the multi-layer neural network model.

Similar to the embodiment, the multi-layer neural network model in the second embodiment is also a multi-task network model capable of performing object detection and object recognition (feature point detection or attribute detection) on an image.

The multi-layer neural network model of the second embodiment may include a plurality of sub-networks, specifically, as shown in fig. 9, may include the following sub-networks:

1. a first subnetwork and a second subnetwork. The first sub-network and the second sub-network each comprise at least one layer, and the layer level of the first sub-network is higher than that of the second sub-network, i.e. the first sub-network is at an upper layer of the second sub-network. The first and second sub-networks may be convolutional neural networks.

2. And a third sub-network for performing object detection. The third subnetwork may be an FD subnetwork when face detection is performed.

3. And a fourth sub-network for performing object recognition. The fourth subnetwork may be an FFPD subnetwork when feature point detection on the face is performed; in performing face attribute recognition, the fourth sub-network may be a sub-network for gender recognition.

4. An RPN subnetwork.

Step S202: and inputting the output characteristic diagram of the second sub-network into a third sub-network, and training the first sub-network, the second sub-network and the third sub-network according to the target detection result of the third sub-network and a predetermined detection true value.

Taking face detection as an example, the image to be trained comprising the face and the true value of the face detection are predetermined, wherein the true value comprises the width and the height (w, h) of the minimum rectangular frame of the face and the lower left corner coordinates (x, y). When the image to be trained is input into the network model, the image to be trained is transmitted forwards in the first sub-network and the second sub-network until the third sub-network outputs the predicted value of the face size and the face position. And taking the predicted value and the true value (face position and size) as input of a loss function, and updating weight values in the first sub-network, the second sub-network and the third sub-network after Back Propagation (Back Propagation) of the calculated error result to finish training of the three sub-networks. The loss function here may be a summation of L2/L1 loss and softmax.

Step S203: and inputting the output characteristic diagram of the first sub-network into a fourth sub-network, and training the fourth sub-network according to the target recognition result of the fourth sub-network and a predetermined recognition true value.

In this step S203, since the output feature map of the front layer in the network model contains abundant geometric information, it is more suitable for classification and feature point detection services, and therefore, the output feature map of the first sub-network of the front layer is used as the input of the fourth sub-network for target recognition, and the fourth sub-network is trained. The result output from the fourth subnetwork is the predicted target recognition result. And taking the predicted target recognition result and a true value (characteristic points on the human face) as input of a loss function, and updating a weight value in the fourth sub-network after the calculated error result is reversely propagated to finish training the fourth sub-network. In the back propagation of the fourth sub-network, the weight values in the first, second and third sub-networks are fixed, i.e. the first, second and third sub-networks are not updated. Here, the true value of the feature point on the face may be predetermined, or may be a position of the feature point of the real face obtained by overlapping together the predicted value for face detection and the true value for face detection in the third sub-network.

In the training method of the second embodiment, the sequence of training the first, second, and third sub-networks and the sequence of training the fourth sub-network are not fixed. For example, training of the fourth sub-network may be performed after training of the first, second, and third sub-networks is completed; the fourth sub-network may also be trained after each training of the first, second and third sub-networks.

Step S204: and when the training ending condition is met, training the multi-layer neural network model is completed.

Here, the training end conditions include, but are not limited to: the training times reach the set number, the training time reaches the set time, the output results of the third sub-network and the fourth sub-network reach the precision requirement, and the like. And when the training ending condition is met, ending the training of the multi-network model.

In the network structure shown in fig. 9, tasks that the trained multi-layer neural network model can perform include a target detection task (third sub-network) and a target recognition task (fourth sub-network). Here, the object recognition task may be further subdivided into a feature point detection task of an object and an attribute detection task of the object. The more the geometrical information contained in the output feature map of the hierarchy with the position being more forward, the more favorable the feature point detection task of the target; the output feature map of the hierarchy with relatively rear position contains a proper amount of geometric information and a certain amount of semantic information, which is beneficial to the attribute detection task of the target. In this regard, the network model shown in FIG. 9 may be further refined to divide the first subnetwork into a first portion and a second portion, wherein the first portion is higher in level in the network model than the second portion. Accordingly, the fourth sub-network is divided into a feature point detection sub-network and an attribute detection sub-network, as shown in fig. 10. And when the fourth sub-network is trained, the output characteristic diagram of the first part in the first sub-network is input into the characteristic point detection sub-network, and the characteristic point detection sub-network is trained. And the result output from the characteristic point detection sub-network is a predicted characteristic point detection result, the predicted characteristic point information and a true value (characteristic points on a human face) are used as input of a loss function, and the calculated error result is subjected to back propagation, and then the weight value in the characteristic point detection sub-network is updated to complete training of the characteristic point detection sub-network. Similarly, the output feature map of the second part in the first sub-network is input into the attribute detection sub-network, and the attribute detection sub-network is trained. The result output from the attribute detection sub-network is a predicted attribute detection result. And taking the predicted attribute information and true values (skin color and texture of the human face) as input of a loss function, and updating weight values in the attribute detection sub-network after the calculated error result is reversely propagated to finish training the attribute detection sub-network.

Note that the "first", "second", "third" subnetworks described in the second embodiment are not associated with the "first", "second", "third" subnetworks in the first embodiment. The terms "first", "second", "third" are used herein to distinguish between different levels of subnetworks.

After training the multiple sub-networks in the multi-layer neural network model in steps S201 to S204, multiple tasks may be performed using the trained network model. The application method in the second embodiment is described below with reference to fig. 11 (a) and 11 (b). Here, the application flow of the multilayer neural network model shown in fig. 11 (a) and 11 (b) is implemented by causing the CPU 10 to execute a program stored in the ROM and/or the external memory 14 by using the RAM as a working memory.

Step S301: the original image is input to the multi-layer neural network model.

The network model here may be a network model trained according to the flow shown in fig. 8, and note that the present embodiment is not limited to the application of the network model trained in other ways to the application flows shown in fig. 11 (a) and 11 (b).

Step S302: and inputting the output characteristic diagram output from the second sub-network after passing through the first sub-network and the second sub-network to a third sub-network so that the third sub-network performs a target detection task based on the received characteristic diagram.

Step S303: and inputting the output characteristic diagram output from the first sub-network after passing through the first sub-network to a fourth sub-network so that the fourth sub-network performs a target recognition task based on the received characteristic diagram.

The above steps S302 and S303 may be performed in parallel, and the feature map is sequentially performed during the forward propagation of the network model. In addition, steps such as, for example, the RPN subnetwork determining the candidate region box and extracting the sub-region feature map including the target based on the candidate region box are similar to those in the first embodiment, and will not be described here.

Of course, if the network model is in the structure shown in fig. 10, namely: the first sub-network includes a first portion and a second portion, and the fourth sub-network includes a feature point detection sub-network and an attribute detection sub-network, and the step S303 specifically includes:

step S303a: and inputting the output characteristic diagram output from the first part after passing through the first part in the first sub-network into the characteristic point detection sub-network so that the characteristic point detection sub-network detects the characteristic point of the target based on the received characteristic diagram.

Step S303b: and inputting the output feature map output from the second part after passing through the first part and the second part in the first sub-network into the attribute detection sub-network so that the attribute detection sub-network performs attribute detection of the target based on the received feature map.

When the multi-task network model trained in the second embodiment is used for application processing, the geometric information and the semantic information required by different tasks are considered, and the feature graphs of different levels are applied to different tasks, so that the output results of the target detection task and the target recognition task are better.

< third exemplary embodiment >

In the application schemes according to the first and second embodiments, when a plurality of candidate region frames are determined by using the RPN subnetwork, an NMS (Non-maximum suppression ) technique may be used to extract a sub-region feature map including the target based on the candidate region frames. In the NMS technology, even if the RPN subnetwork determines multiple candidate region frames for the same target, only one candidate region frame is selected to extract a sub-region feature map including the target, and the recognition result of other candidate region frames on the target is discarded. As shown in fig. 12, the RPN subnetwork determines a plurality of candidate region blocks for a face in the original image. After the NMS technology is adopted, one candidate region frame with higher accuracy is selected from a plurality of candidate region frames, and after the candidate region frame is utilized to determine the subarea feature map comprising the human face in the feature map, the human face is identified based on the prediction of the subarea feature map. Thus, in the target recognition task, for example, the face feature point detection and the face attribute detection task, the accuracy of the output result is not good enough due to the limited reliability of the recognition result of one candidate region frame. In view of this, the third embodiment proposes an optimization method for realizing target recognition by using recognition results of a plurality of candidate region frames of the same target. The optimization method in the third embodiment is described below with reference to fig. 13. Here, the application flow of the multilayer neural network model shown in fig. 13 is implemented by causing the CPU 10 to execute a program stored in the ROM and/or the external memory 14 by using the RAM as a working memory.

Step S401: and inputting an original image into the multi-layer neural network model, and determining a plurality of candidate region frames for each target by utilizing the RPN sub-network.

Here, the candidate region frames for each target are not limited to be determined by other algorithms, and the third embodiment is not limited thereto.

Step S402: and grouping the determined multiple candidate region frames, wherein the multiple candidate region frames in the same group correspond to the same target.

The plurality of candidate region frames determined in step S401 may be candidate region frames corresponding to a plurality of targets, each of which contains a position of a possible target. In this step S402, candidate region frames that may correspond to the same target may be divided into a group by clustering, for example, the positions and sizes of the targets in the candidate region frames.

Step S403: and carrying out target identification by utilizing the sub-region feature graphs extracted from the plurality of candidate region frames.

In this step S403, in the manner described in the first embodiment, the sub-region feature map extracted from each candidate region frame may be sequentially subjected to target recognition tasks such as face feature point detection and face attribute detection, and a recognition result may be obtained for each sub-region feature map.

Step S404: according to the grouping, fusing the recognition results of target recognition on the sub-region feature images extracted from the candidate region frames in the same group, and taking the fused recognition results as final recognition results.

The processes of step S403 and step S404 are described below by taking face feature point detection and face attribute detection as examples, respectively.

It is assumed that a total of six candidate region frames for two targets are determined in step S401, and the six candidate region frames are divided into two groups according to the corresponding targets in step S402, the first group including candidate region frames 1 to 3 corresponding to target 1 and the second group including candidate region frames 4 to 6 corresponding to target 2.

In the face feature point detection task, firstly, six sub-region feature diagrams are extracted based on the six candidate region frames, and the six sub-region feature diagrams are respectively input into a feature point detection sub-network to obtain six face feature point coordinate values. Then, based on the grouping information in step S402, face feature point coordinate values 1 to 3 corresponding to the target 1 and face feature point coordinate values 4 to 6 corresponding to the target 2 are determined. And finally, fusing the face feature point coordinate values 1 to 3 and fusing the face feature point coordinate values 4 to 6 to obtain the face feature point information aiming at the target 1 and the target 2.

The third embodiment includes, but is not limited to, the following two methods for fusing the coordinate values of the face feature points:

the method comprises the following steps: coordinate averaging method. And averaging the coordinate values of each face feature point. It is assumed that, in the image shown in fig. 14, three sub-region feature maps extracted based on the candidate region frame 1 to the candidate region frame 3 of the object 1, after the sub-network is detected through the feature points (left eye), three heat maps shown in the right side of the map are obtained, wherein the probability that the position of the heat map where the color is whiter is the left eye is higher. And averaging coordinate values of the positions with the same left eye probability in each heat map to obtain a fused heat map.

The second method is as follows: and (5) adding the bits one by one. When the recognition result of the object recognition is the pixel of the feature map, if the number of channels of the feature map is the same, fusion can be performed by using a mode of adding the pixel values of the feature map bit by bit.

In the face attribute detection task, similar to the face feature point detection task, six sub-region feature maps are extracted based on six candidate region frames, and the six sub-region feature maps are respectively input into a face attribute detection sub-network to obtain six face attribute prediction results. Then, based on the grouping information in step S402, the face attribute prediction results 1 to 3 corresponding to the target 1 and the face attribute prediction results 4 to 6 corresponding to the target 2 are determined. Finally, the face attribute predicting results 1 to 3 are fused in an averaging manner, and the face attribute predicting results 4 to 6 are fused in an averaging manner, so that the face attribute information aiming at the target 1 and the target 2 is obtained.

The third embodiment includes, but is not limited to, the following method for fusing the face attribute prediction results: voting method. It is assumed that, in the image shown in fig. 15, as shown in the figure, candidate region frames 1 to 3 for the target 1, the face attribute prediction result based on the candidate region frame 1 is: male with a probability of 0.85; the face attribute prediction result based on the candidate region frame 2 is: male with a probability of 0.90; the face attribute prediction result based on the candidate region frame 3 is: the probability for men is 0.80. After averaging the three face attribute prediction results, obtaining an attribute detection result of the target 1 as follows: the probability for men is 0.85.

Based on the application method shown in the fifth embodiment, the comprehensive consideration of the recognition results of the plurality of candidate region frames on the same target is equivalent to performing a plurality of times of disturbance and prediction on the target, and compared with the recognition result of selecting one candidate region frame on the target, the more optimized target recognition result can be obtained no matter whether the feature point detection or the attribute recognition service is performed on the target.

< fourth exemplary embodiment >

A fourth exemplary embodiment of the present disclosure is an application apparatus of a multi-layer neural network model under the same inventive concept as the first exemplary embodiment, the application apparatus of the fourth embodiment includes a feature map extraction unit, a sub-region feature map acquisition unit, and a target recognition unit, wherein the feature map extraction unit extracts output feature maps of at least two layers in the network model based on images input to the multi-layer neural network model; the sub-region feature map acquisition unit acquires sub-region feature maps corresponding to the same target in the extracted output feature maps according to a predetermined candidate region frame; the target recognition unit performs target recognition by using the obtained plurality of sub-region feature maps.

Preferably, the target recognition unit fuses the obtained multiple sub-region feature maps, and performs target recognition by using the fused sub-region feature maps.

Preferably, the target recognition unit adjusts the sizes of the obtained multiple sub-region feature maps to make the sizes of the adjusted sub-region feature maps identical; then, fusing the multiple sub-region feature maps with the adjusted sizes into a sub-region feature map in a serial connection mode; or when the channel numbers of the sub-region feature graphs of different layers are the same, fusing the sub-region feature graphs with the adjusted sizes into a sub-region feature graph according to a mode of adding the feature graphs bit by bit.

Preferably, the application device further comprises a target detection unit configured to perform target detection using the fused sub-region feature map.

< fifth exemplary embodiment >

The fifth exemplary embodiment of the present disclosure is a training apparatus and an application apparatus of a multilayer neural network model under the same inventive concept as the second exemplary embodiment. The network model comprises a first sub-network and a second sub-network, the first sub-network and the second sub-network respectively comprise at least one layer, and the layer level of the first sub-network in the network model is higher than that of the second sub-network; the network model also includes a third sub-network for object detection and a fourth sub-network for object identification.

The training device of the fifth embodiment includes a first training unit and a second training unit, where the first training unit inputs an output feature map of the second sub-network to the third sub-network, and trains the first sub-network, the second sub-network, and the third sub-network according to a target detection result of the third sub-network and a predetermined detection true value; the second training unit inputs the output characteristic diagram of the first sub-network to the fourth sub-network, and trains the fourth sub-network according to the target recognition result of the fourth sub-network and a predetermined recognition true value.

Preferably, the first sub-network comprises a first part and a second part, wherein the first part is higher in level in the network model than the second part; the fourth sub-network includes a feature point detection sub-network and an attribute detection sub-network. The second training unit inputs the output feature map of the first part in the first sub-network into the feature point detection sub-network, trains the feature point detection sub-network according to the feature point detection result of the feature point detection sub-network and a predetermined feature point true value, and inputs the output feature map of the second part in the first sub-network into the attribute detection sub-network, trains the attribute detection sub-network according to the attribute detection result of the attribute detection sub-network and the predetermined attribute true value.

The application device of the fifth embodiment includes a target detection unit and a target recognition unit, where the target detection unit inputs the output feature map of the second sub-network to the third sub-network, so that the third sub-network performs a target detection task based on the received feature map. The target recognition unit inputs the output feature map output by the first sub-network to the fourth sub-network, so that the fourth sub-network performs a target recognition task based on the received feature map.

Preferably, the first sub-network comprises a first part and a second part, wherein the first part is higher in level in the network model than the second part; the fourth sub-network includes a feature point detection sub-network and an attribute detection sub-network. The target recognition unit inputs the output feature image of the first part in the first sub-network to the feature point detection sub-network so that the feature point detection sub-network detects the feature point of the target based on the received feature image; and inputting the output feature map of the second part in the first sub-network to the attribute detection sub-network so that the attribute detection sub-network performs attribute detection of the target based on the received feature map.

< sixth exemplary embodiment >

A sixth exemplary embodiment of the present disclosure is an application apparatus of a multi-layer neural network model under the same inventive concept as the third exemplary embodiment. The application device of the sixth embodiment includes a candidate region frame determining unit, a sub-region feature map extracting unit, a target identifying unit and a fusion unit, wherein the candidate region frame determining unit determines a plurality of candidate region frames for targets in an image to be identified; the sub-region feature map extracting unit extracts a corresponding sub-region feature map based on at least two candidate region frames aiming at the same target in the determined multiple candidate region frames; the target recognition unit performs target recognition by using the extracted sub-region feature map; the fusion unit fuses the recognition results of the target recognition and takes the fused recognition results as final recognition results.

Preferably, the application apparatus further includes a grouping unit configured to group the determined candidate region frames, wherein a plurality of candidate region frames in the same group correspond to the same target. The fusion unit fuses the recognition results of target recognition based on the sub-region feature images extracted from the candidate region frames in the same group according to the grouping, and takes the fused recognition results as final recognition results.

Preferably, when the target is identified as the feature point detection and the identification result of the target is the coordinate value of the feature point, the fusion unit fuses the coordinate values of the same feature point by averaging the coordinate values. When the object recognition is feature point detection and the recognition result of the object recognition is the pixel value of the feature map, the fusion unit fuses by utilizing a mode of adding the feature map bit by bit under the condition that the channel number of the feature map is the same.

Preferably, when the object is identified as attribute detection and the identification result of the object identification is an attribute prediction result of the extracted plurality of sub-region feature graphs, the fusion unit fuses by using a mode that the plurality of attribute prediction results are averaged.

Other embodiments

Embodiments of the present disclosure may also be implemented by a computer or a system including one or more circuits (e.g., application Specific Integrated Circuits (ASICs)) for executing one or more functions of the above embodiments by reading and executing computer-executable instructions (e.g., one or more programs) recorded on a storage medium (which may also be more fully referred to as "non-transitory computer-readable storage medium"), and by a method executed by a computer of the system or the system, by reading and executing computer-readable instructions from the storage medium to perform the functions of the one or more embodiments and/or controlling the one or more circuits to perform the functions of the one or more embodiments, for example. The computer may include one or more processors (e.g., a Central Processing Unit (CPU), micro-processing unit (MPU)), and may include a separate computer or a network of separate processors to read out and execute the computer-executable instructions. The computer-executable instructions may be provided to the computer from, for example, a network or a storage medium. The storage medium may include, for example, one or more of a hard disk, random Access Memory (RAM), read Only Memory (ROM), storage for a distributed computing system, an optical disk such as a Compact Disk (CD), digital Versatile Disk (DVD), or blu-ray disk (BD) (registered trademark), a flash memory device, a memory card, etc.

The embodiments of the present disclosure can also be implemented by a method in which software (program) that performs the functions of the above embodiments is supplied to a system or apparatus, a computer of the system or apparatus or a Central Processing Unit (CPU), a Micro Processing Unit (MPU), or the like, through a network or various storage mediums, and the program is read out and executed.

While the present disclosure has been described with reference to exemplary embodiments, it is to be understood that the present disclosure is not limited to the disclosed exemplary embodiments. The scope of the following claims is to be accorded the broadest interpretation so as to encompass all such modifications and equivalent structures and functions.

Claims

1. An application method of a multi-layer neural network model comprises a first sub-network and a second sub-network, wherein the first sub-network and the second sub-network respectively comprise at least one layer, and the layer level of the first sub-network in the network model is higher than that of the second sub-network; the network model also comprises a third sub-network for performing target detection and a fourth sub-network for performing target identification; wherein a feature map for object detection is input from the second sub-network to the third sub-network, and a feature map for object recognition is input from the first sub-network to the fourth sub-network, the application method comprising:

Extracting output feature graphs of at least two layers in the first sub-network based on images input to a multi-layer neural network model;

obtaining a sub-region feature map corresponding to the same target in each extracted output feature map according to a predetermined candidate region frame;

and performing target identification in a fourth sub-network by using the obtained multiple sub-region feature maps.

2. The application method according to claim 1, wherein the target recognition is performed by using the obtained multiple sub-region feature maps, and specifically comprises:

and fusing the obtained multiple sub-region feature images, and carrying out target identification by utilizing the fused sub-region feature images.

3. The application method according to claim 2, wherein the fusing the obtained multiple sub-region feature maps specifically comprises:

the sizes of the obtained sub-region feature graphs are adjusted, so that the sizes of the adjusted sub-region feature graphs are the same;

fusing the multiple sub-region feature images with the adjusted sizes into a sub-region feature image in a serial connection mode; or alternatively

And when the channel numbers of the sub-region feature maps of different layers are the same, fusing the sub-region feature maps with the adjusted sizes into a sub-region feature map in a mode of adding the feature maps bit by bit.

4. The application method according to claim 2, wherein the multi-layer neural network model is a network model capable of performing a target detection task and a target recognition task;

the application method further comprises the following steps:

and performing target detection by using the fused subarea feature map.

5. The application method according to claim 1, wherein at least two layers of the extracted output feature map are middle layers of the network model.

6. The training method of the multi-layer neural network model is characterized in that the network model comprises a first sub-network and a second sub-network, the first sub-network and the second sub-network respectively comprise at least one layer, and the layer level of the first sub-network in the network model is higher than that of the second sub-network; the network model also comprises a third sub-network for performing target detection and a fourth sub-network for performing target identification;

the training method comprises the following steps:

inputting images to a multi-layer neural network model, inputting an output feature map of a second sub-network to the third sub-network, and training the first sub-network, the second sub-network and the third sub-network according to a target detection result of the third sub-network and a predetermined detection true value;

and inputting the output characteristic diagram of the first sub-network into the fourth sub-network, and training the fourth sub-network according to the target identification result of the fourth sub-network and a predetermined identification true value.

7. The training method of claim 6, wherein the first subnetwork comprises a first portion and a second portion, wherein the first portion is higher in level in the network model than the second portion; the fourth sub-network comprises a characteristic point detection sub-network and an attribute detection sub-network;

inputting an output feature map of a first part in a first sub-network into a feature point detection sub-network, and training the feature point detection sub-network according to a feature point detection result of the feature point detection sub-network and a predetermined feature point true value;

and inputting the output feature map of the second part in the first sub-network into an attribute detection sub-network, and training the attribute detection sub-network according to an attribute detection result of the attribute detection sub-network and a predetermined attribute true value.

8. The application method of the multi-layer neural network model is characterized in that the network model comprises a first sub-network and a second sub-network, the first sub-network and the second sub-network respectively comprise at least one layer, and the layer level of the first sub-network in the network model is higher than that of the second sub-network; the network model also comprises a third sub-network for performing target detection and a fourth sub-network for performing target identification;

The application method comprises the following steps:

inputting an image to the multi-layer neural network model, and inputting an output feature map of the second sub-network to a third sub-network so that the third sub-network performs a target detection task based on the received feature map;

and inputting the output characteristic diagram output by the first sub-network into a fourth sub-network so that the fourth sub-network performs a target recognition task based on the received characteristic diagram.

9. The application method of claim 8, wherein the first sub-network comprises a first portion and a second portion, wherein the first portion is higher in level in the network model than the second portion; the fourth sub-network comprises a characteristic point detection sub-network and an attribute detection sub-network;

inputting the output feature map of the first part in the first sub-network to a feature point detection sub-network so that the feature point detection sub-network detects the feature point of the target based on the received feature map;

and inputting the output feature map of the second part in the first sub-network to the attribute detection sub-network so that the attribute detection sub-network performs attribute detection of the target based on the received feature map.

10. An application method of a multi-layer neural network model comprises a first sub-network and a second sub-network, wherein the first sub-network and the second sub-network respectively comprise at least one layer, and the layer level of the first sub-network in the network model is higher than that of the second sub-network; the network model also comprises a third sub-network for performing target detection and a fourth sub-network for performing target identification; wherein a feature map for object detection is input from the second sub-network to the third sub-network, and a feature map for object recognition is input from the first sub-network to the fourth sub-network, the application method comprising:

Determining a plurality of candidate region frames for targets in an image to be identified;

extracting a corresponding sub-region feature map in the first sub-network based on at least two candidate region frames aiming at the same target in the determined multiple candidate region frames;

performing target identification in a fourth sub-network by using the extracted sub-region feature map;

fusing the recognition results of target recognition, and taking the fused recognition results as final recognition results.

11. The application method according to claim 10, wherein the application method further comprises:

grouping the determined candidate region frames, wherein a plurality of candidate region frames in the same group correspond to the same target;

extracting a corresponding sub-region feature map based on the determined candidate region frames, and carrying out target identification by utilizing the extracted sub-region feature map;

according to the grouping, fusing the recognition results of target recognition based on the sub-region feature images extracted from the candidate region frames in the same group, and taking the fused recognition results as final recognition results.

12. The application method according to claim 10, wherein when the object recognition is feature point detection and the recognition result of the object recognition is coordinate values of feature points, fusion is performed by averaging the coordinate values of the same feature points; or,

When the object recognition is feature point detection and the recognition result of the object recognition is the pixel value of the feature map, fusion is performed by adding the feature maps bit by bit under the condition that the channel numbers of the feature maps are the same.

13. The application method according to claim 10, wherein when the object recognition is attribute detection and the recognition result of the object recognition is an attribute prediction result of the extracted plurality of sub-region feature maps, fusion is performed by averaging the plurality of attribute prediction results.

14. An application device of a multi-layer neural network model, wherein the multi-layer neural network model comprises a first sub-network and a second sub-network, the first sub-network and the second sub-network respectively comprise at least one layer, and the layer level of the first sub-network in the network model is higher than that of the second sub-network; the network model also comprises a third sub-network for performing target detection and a fourth sub-network for performing target identification; wherein a feature map for object detection is input from the second sub-network to the third sub-network, and a feature map for object recognition is input from the first sub-network to the fourth sub-network, the application apparatus comprising:

a feature map extraction unit configured to extract an output feature map of at least two layers in the first sub-network based on an image input to the multi-layer neural network model;

A sub-region feature map acquisition unit configured to acquire sub-region feature maps corresponding to the same target in the extracted output feature maps according to predetermined candidate region frames;

and an object recognition unit configured to perform object recognition in the fourth sub-network using the obtained plurality of sub-area feature maps.

15. The application apparatus according to claim 14, wherein the target recognition unit fuses the obtained plurality of sub-region feature maps, and performs target recognition using the fused sub-region feature maps.

16. The application device of claim 15, wherein,

the target recognition unit adjusts the sizes of the obtained sub-region feature images to enable the sizes of the adjusted sub-region feature images to be the same, and fuses the sub-region feature images with the adjusted sizes into a sub-region feature image in a serial connection mode; or when the channel numbers of the sub-region feature graphs of different layers are the same, fusing the sub-region feature graphs with the adjusted sizes into a sub-region feature graph according to a mode of adding the feature graphs bit by bit.

17. The application device of claim 15, wherein the multi-layer neural network model is a network model capable of performing target detection tasks and target recognition tasks;

The application device further includes:

and a target detection unit configured to perform target detection using the fused sub-region feature map.

18. A training device for a multi-layer neural network model, characterized in that an image is input to the multi-layer neural network model, the network model comprises a first sub-network and a second sub-network, the first sub-network and the second sub-network respectively comprise at least one layer, and the layer level of the first sub-network in the network model is higher than that of the second sub-network; the network model also comprises a third sub-network for performing target detection and a fourth sub-network for performing target identification;

the training device comprises:

a first training unit configured to input an output feature map of a second sub-network to the third sub-network, the first sub-network, the second sub-network, and the third sub-network being trained according to a target detection result of the third sub-network and a predetermined detection true value;

and a second training unit configured to input an output feature map of the first sub-network to the fourth sub-network, the fourth sub-network being trained according to a target recognition result of the fourth sub-network and a predetermined recognition true value.

19. The training device of claim 18, wherein the first subnetwork comprises a first portion and a second portion, wherein the first portion is higher in level in the network model than the second portion; the fourth sub-network comprises a characteristic point detection sub-network and an attribute detection sub-network;

the second training unit inputs the output feature map of the first part in the first sub-network into the feature point detection sub-network, trains the feature point detection sub-network according to the feature point detection result of the feature point detection sub-network and a predetermined feature point true value, and inputs the output feature map of the second part in the first sub-network into the attribute detection sub-network, trains the attribute detection sub-network according to the attribute detection result of the attribute detection sub-network and the predetermined attribute true value.

20. An application device of a multi-layer neural network model is characterized in that an image is input to the multi-layer neural network model, the network model comprises a first sub-network and a second sub-network, the first sub-network and the second sub-network respectively comprise at least one layer, and the layer level of the first sub-network in the network model is higher than that of the second sub-network; the network model also comprises a third sub-network for performing target detection and a fourth sub-network for performing target identification;

The application device comprises:

a target detection unit configured to input an output feature map of the second sub-network to the third sub-network, so that the third sub-network performs a target detection task based on the received feature map;

and a target recognition unit configured to input the output feature map output by the first sub-network to the fourth sub-network, so that the fourth sub-network performs a target recognition task based on the received feature map.

21. The application device of claim 20, wherein the first subnetwork comprises a first portion and a second portion, wherein the first portion is higher in hierarchy in the network model than the second portion; the fourth sub-network comprises a characteristic point detection sub-network and an attribute detection sub-network;

the target recognition unit inputs the output feature image of the first part in the first sub-network to the feature point detection sub-network so that the feature point detection sub-network detects the feature point of the target based on the received feature image; and inputting the output feature map of the second part in the first sub-network to the attribute detection sub-network so that the attribute detection sub-network performs attribute detection of the target based on the received feature map.

22. An application device of a multi-layer neural network model, wherein the multi-layer neural network model comprises a first sub-network and a second sub-network, the first sub-network and the second sub-network respectively comprise at least one layer, and the layer level of the first sub-network in the network model is higher than that of the second sub-network; the network model also comprises a third sub-network for performing target detection and a fourth sub-network for performing target identification; wherein a feature map for object detection is input from the second sub-network to the third sub-network, and a feature map for object recognition is input from the first sub-network to the fourth sub-network, the application apparatus comprising:

A candidate region frame determination unit configured to determine a plurality of candidate region frames for a target in an image to be identified;

a sub-region feature map extraction unit configured to extract a corresponding sub-region feature map in the first sub-network based on at least two candidate region boxes for the same target among the determined plurality of candidate region boxes;

a target recognition unit configured to perform target recognition in the fourth sub-network using the extracted sub-region feature map;

and a fusion unit configured to fuse the recognition results of the object recognition and to take the fused recognition results as final recognition results.

23. The application device of claim 22, wherein the application device further comprises:

a grouping unit configured to group the determined candidate region frames, wherein a plurality of candidate region frames in the same group correspond to the same target;

the fusion unit fuses the recognition results of target recognition based on the sub-region feature images extracted from the candidate region frames in the same group according to the grouping, and takes the fused recognition results as final recognition results.

24. The application device of claim 22, wherein,

When the target identification is feature point detection and the identification result of the target identification is the coordinate value of the feature point, the fusion unit fuses by taking the average of the coordinate values of the same feature point; or,

when the object recognition is feature point detection and the recognition result of the object recognition is the pixel value of the feature map, the fusion unit fuses by utilizing a mode of adding the feature map bit by bit under the condition that the channel number of the feature map is the same.

25. The application apparatus according to claim 22, wherein the fusion unit fuses by averaging a plurality of attribute prediction results when the object recognition is attribute detection and the recognition result of the object recognition is an attribute prediction result of the extracted plurality of sub-region feature maps.

26. A non-transitory computer-readable storage medium storing instructions that, when executed by a computer, cause the computer to perform a method of application based on the multi-layer neural network model of claim 1.

27. A non-transitory computer-readable storage medium storing instructions that, when executed by a computer, cause the computer to perform a training method based on the multi-layer neural network model of claim 6.

28. A non-transitory computer-readable storage medium storing instructions that, when executed by a computer, cause the computer to perform a method of application based on the multi-layer neural network model of claim 8.

29. A non-transitory computer-readable storage medium storing instructions that, when executed by a computer, cause the computer to perform a method of application based on the multi-layer neural network model of claim 10.