CN113111679A

CN113111679A - Design method of human-shaped upper half monitoring network structure

Info

Publication number: CN113111679A
Application number: CN202010020483.7A
Authority: CN
Inventors: 于晓静
Original assignee: Beijing Ingenic Semiconductor Co Ltd
Current assignee: Beijing Ingenic Semiconductor Co Ltd
Priority date: 2020-01-09
Filing date: 2020-01-09
Publication date: 2021-07-13

Abstract

The invention provides a human upper body monitoring network structure design method, which comprises the following steps: s1, initially setting, namely setting the input size to be m x m, wherein m is more than 24; when the stride is 1, the corresponding calculation formula is as follows: n-2, when the stride is 2, the corresponding calculation formula is as follows: (n-2)/2, wherein n is the input size of each layer; because the final output is Nx3x3, and N is the number of channels, performing convolution of 3x3 and finally regressing the score and the regression position coordinates of the face candidate frame; s2, setting the stride to be 2, and reversely deducing the input size according to the calculation formula in S1 to obtain the input size, the form and the calculation amount of each layer of the network; s3, according to the designed network obtained in S2, the human upper body is trained in the first stage.

Description

Design method of human-shaped upper half monitoring network structure

Technical Field

The invention relates to the field of intelligent identification, in particular to a design method of a human-shaped upper half body monitoring network structure.

Background

With the continuous development of science and technology, particularly the development of computer vision technology, the intelligent identification technology is widely applied to various fields of information security, electronic authentication and the like, and the image feature extraction method has good identification performance.

Furthermore, common terms in the art include:

detecting the upper half body: i.e. detecting the head to elbow to shoulder neutral position comprising the human body.

Cascading convolutional networks: namely, an N-order network is designed, and tasks are processed from coarse to fine.

The recall ratio is as follows: for example, the number of successfully detected face frames in the picture is the percentage of the original number of faces in the picture.

The false detection rate is as follows: for example, the number of face frames in the picture that fail to be detected accounts for the original number of faces in the picture.

The MAC number is the amount of Multiplay and add calculation.

MTCNN (Multi-task convolutional neural network), which puts face region detection and face keypoint detection together, based on the cascade framework. The whole can be divided into three-layer network structures of a first stage P-Net, a second stage R-Net and a third stage O-Net; the MTCNN implementation process comprises the following steps: outputting from the image- > P-Net- > R-Net- > 0-Net-; and (3) constructing an image pyramid, namely firstly, carrying out transformation of different scales on the image to construct the image pyramid so as to adapt to detection of identification targets with different sizes. That is, the image passes through a pyramid, generates an image of multiple scales, and is then input to the PNet. P-Net (Proposal network), the basic structure is a full connection network. And performing primary feature extraction and frame calibration on the constructed image pyramid through an FCN (full volume computer network), and performing filtration of most windows on a Bounding-Box Regression adjustment window and an NMS (network management system) algorithm. The PNet can quickly select the candidate area due to the small size, but the accuracy is not high, and then the NMS algorithm is adopted to combine the candidate frames and extract the image according to the candidate frames. The R-Net (refine network) is basically a convolutional neural network, and a full connection layer is added relative to the P-Net of the first layer, so that the screening of input data is stricter. After a picture passes through P-Net, a plurality of prediction windows are left, all the prediction windows are sent to R-Net, a network filters a large number of candidate frames with poor effects, and finally, Bounding-Box Regression and NMS algorithm are carried out on the selected candidate frames to further optimize prediction results. As an input to the RNet, the RNet can select frames accurately, typically leaving only a few frames at the end, and input the ONet at the end. The basic structure of the O-Net (output network) is a more complex convolutional neural network, and a convolutional layer is added compared with the R-Net. The effect of O-Net is different from that of R-Net in that the structure of the layer can identify the face region through more supervision and regress the face feature points of the human, and finally five feature points are output. ONet, although slower, has fewer images input into ONet because it has already obtained a high probability of bounding boxes through the first two networks, and then ONet outputs accurate bounding box and keypoint information.

However, in the prior art, there are some problems, the first-order network is generally used to generate candidate frames quickly for use in the network at the later stage, and the general first-order network design method is that the resolution is small because the maximum funnel speed is necessarily fast, for example, the design input size of a human face is 12 × 12, but such design method has a problem in the actual use engineering, and there are many candidate frames generated by the first-order network because the resolution is too small, which results in feature loss, the expression capability is reduced, and sometimes the number of candidate frames is up to several thousand. The effect is good from the recall rate, but the error rate effect is very poor, the efficiency of the later stage is greatly influenced, and the embedded product application is fatal, because a great deal of calculation amount is increased, for example, thousands of images are increased to be sent to the next stage network, and the next stage network is a fine network, the calculation amount is increased greatly, and the benefit brought by cascade use is greatly reduced.

Disclosure of Invention

In order to solve the above problems, the present invention is directed to: the cascade convolution network is divided into N stages, the function of the first stage is crucial, the invention reduces the calculated amount, greatly reduces the false detection rate and greatly reduces the network pressure of the later stages under the condition of ensuring the recall rate, and the whole project is effective when the embedded platform falls on the ground.

Specifically, the invention provides a human upper body monitoring network structure design method, which comprises the following steps:

s1, initial setting, setting input size mxm, wherein m > 24;

when the stride is 1, the corresponding calculation formula is as follows: n-2, when the stride is 2, the corresponding calculation formula is as follows: (n-2)/2, wherein n is the input size of each layer;

because the final output is Nx3x3, and N is the number of channels, performing convolution of 3x3 and finally regressing the score and the regression position coordinates of the face candidate frame;

s2, setting the stride to be 2, and reversely deducing the input size according to the calculation formula in S1 to obtain the input size, the form and the calculation amount of each layer of the network;

s3, according to the designed network obtained in S2, the human upper body is trained in the first stage.

In step S1, m is 38, i.e., the input size of the human-shaped upper body is 38x 38.

The N is 16.

The step S2 further includes:

a first layer: where the first layer n is 38, so (38-2)/2 is 18:

a second layer: as can be seen from the results obtained by the first layer calculation, the second layer n is 18, and therefore (18-2)/2 is 8:

and a third layer: as can be seen from the results obtained by the second layer calculation, the third layer n is 8, and thus (8-2)/2 is 3.

By this formula, we reverse the Input size, stride is all 2, (n-2)/stride is 3, and the inputs for each layer are 8,18, 38. Here, 3 is the size of the feature map obtained in the last layer, that is, 3 inside Nx3x3, because the final output is Nx3x3, N is the number of channels, and the score and the regression position coordinates of the face candidate frame are finally regressed by performing convolution with 3 × 3.

The calculated components of the first layer, the second layer and the third layer are respectively as follows:

first layer calculated amount: 18,3, 16, 1, 46,656

Second layer calculated quantity: 8,3, 16, 147,456

Third layer calculated quantity: 3x 16 x 20,736;

the total calculation amount was calculated to be 214,848.

And a verification step, when N is 16, comparing the calculated total calculated amount with the total calculated amount of the designed network with the input size of 24x24 and the stride of 1,2,1 and 2 in the prior art, and obtaining a verification conclusion that the calculated amount of the method is less than the calculated amount in the prior art.

The prior art specifically includes that the design size of a human-shaped upper half body sample is 24x24, the stride of network design needs to be 1,2,1,2, the final output is Nx3x3, N is the number of channels, convolution of 3x3 is performed, and finally scores and regression position coordinates of a human face candidate frame are regressed, the number of channels is 16 and serves as a standard, 22x22 is obtained after the first layer of convolution, 10x10 is obtained after the second layer of convolution, 8x8 is obtained at the third layer, and 3x3 is obtained at the fourth layer; the specific calculated amount is as follows:

first layer calculated amount: 22, 3, 16, 1, 69,696

Second layer calculated quantity: 10, 3, 16, 230,400

Third layer calculated quantity: 8,3, 16, 147,456

Fourth layer calculation amount: 3x 16 x 20,736

The total calculated amount is: 468,288.

Thus, the present application has the advantages that: the method has the advantages of high accuracy and efficiency because the increased resolution results in more feature retention, higher expression capability and more half of calculation amount reduction. The network obtained by the design method is used for carrying out first-stage training on the upper human body, more than 2000 false detections are reduced to about 100 by obtaining actual test data, the efficiency of a first-stage rough detection network and a later fine network is greatly improved, the number of frames to be processed sent to the later network is greatly reduced, and the accuracy of the first-stage network is guaranteed.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the principles of the invention.

FIG. 1 is a schematic flow diagram of the process of the present invention.

Detailed Description

In order that the technical contents and advantages of the present invention can be more clearly understood, the present invention will now be described in further detail with reference to the accompanying drawings.

As shown in FIG. 1, the invention relates to a design method of a human-shaped upper half monitoring network structure, which comprises the following steps:

s1, initially setting, namely setting the input size to be 38x 38;

where 38x38 is the core of the network structure reverse design, increasing the number of network layers will affect performance and will affect detection rate, since this determines the minimum target detection size of the network actually in the frame, and thus affects detection distance, for example, if the upper half of a human body is scaled to 24, its feature loss will be more than 38 and will cause more false detection, 56 will save more details than 38, but will affect detection distance.

When the stride is 2, the corresponding calculation formula is as follows: (n-2)/2, wherein n is the input size of each layer;

wherein, the advantage of choosing the stride to be 2 lies in: the general demand in the field of target detection is that the farther the detection distance is, the better the detection distance is, the larger the detection input resolution is, but when the resolution is increased, the detection time of the whole network will suddenly rise, and no practicability is provided, so that the stride is set to be 2, not only can the small remote target be detected, but also the detection time can be controlled to increase, and the detection time is measured for 5 meters through the actual measurement of 640 and 360 resolution, and the time consumption is 200 milliseconds, and the detection time is measured for 8 meters through 800 and 450 resolution, and the time consumption is 270 milliseconds. Even the first layer time would increase to 650 milliseconds if the stride were set to 1.

Because the final output is Nx3x3, and N is the number of channels, which is set as 16, performing convolution by 3x3 and finally regressing the score and the regression position coordinates of the face candidate frame;

s2, setting the stride to be 2, and reversely deducing the input size according to the calculation formula in S1 to obtain the input size, the form and the calculation amount of each layer of the network:

a first layer: where the first layer n is 38, so (38-2)/2 is 18, the calculated amount is:

18*18*3*3*16*1＝46,656；

a second layer: as can be seen from the results obtained by the first layer calculation, since the second layer n is 18, (18-2)/2 is 8, the amount calculated is: 8,3, 16, 147,456;

and a third layer: as can be seen from the results obtained by the second layer calculation, since the third layer n is 8, (8-2)/2 is 3, the amount of calculation is: 3x 16 x 20,736.

Specifically, it can be further understood as explained below:

1. the face algorithm Input size is 12X12, the human shape upper half body sample design size is one time larger than the face and is 24X24, then the network design stride needs to be 1,2,1,2, because the final output is NX3X3, N is the number of channels, 3X3 convolution is made, and finally the score and regression position coordinates of the face candidate frame are regressed, stride is a formula corresponding to 1 to calculate N-2, stride is a formula corresponding to 2 to calculate (N-2)/2(N is the Input size of each layer), for example, the human shape Input is 24X24, the same number of channels 16 is used as a standard, 22X22 is after the first layer of convolution, 10X10 is after the second layer of convolution, 8X8 is at the third layer, and 3 is at the fourth layer.

First layer calculated amount:

22*22*3*3*16*1＝69,696

second layer calculated quantity:

10*10*3*3*16*16＝230,400

third layer calculated quantity:

8*8*3*3*16*16＝147,456

fourth layer calculation amount:

3*3*3*3*16*16＝20,736

total MAC number: 468,288

2. The input size of our method is 38x38, which is an absolute advantage in terms of accuracy and efficiency because the increased resolution results in more feature retention, stronger expression capability and greatly reduced calculation amount by half

By this formula, we reverse the Input size, stride is all 2, (n-2)/stride is 3, and the inputs for each layer are 8,18, 38.

First layer calculated amount:

18*18*3*3*16*1＝46,656

second layer calculated quantity:

8*8*3*3*16*16＝147,456

third layer calculated quantity:

3*3*3*3*16*16＝20,736

total MAC number: 214,848

3. The human upper body is trained for the first level through the network, the false detection is reduced from more than 2000 to about 100, the efficiency of a first level rough detection network and a later fine network is greatly improved, the number of frames to be processed sent to the later network is greatly reduced, and meanwhile, the accuracy of the first level network is guaranteed.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention, and various modifications and changes may be made to the embodiment of the present invention by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A human-shaped upper half monitoring network structure design method is characterized by comprising the following steps:

s1, initially setting, namely setting the input size to be m x m, wherein m is more than 24;

2. The method as claimed in claim 1, wherein N is 16.

3. The method as claimed in claim 2, wherein m is 38 in step S1, that is, the input size of the upper human body is 38x 38.

4. The method for designing a human upper body monitoring network structure as claimed in claim 3, wherein the step S2 further comprises:

a first layer: wherein the first layer n is 38, so that (38-2)/2 is 18:

a second layer: as can be seen from the results obtained by the first layer calculation, the second layer n is 18, and therefore, (18-2)/2 is 8:

and a third layer: as can be seen from the second layer calculation result, the third layer n is 8, and therefore, (8-2)/2 is 3.

5. The design method of the human-shaped upper half monitoring network structure according to claim 4, wherein the calculated components of the first layer, the second layer and the third layer are respectively as follows:

first layer calculated amount: 18,3, 16, 1, 46,656

Second layer calculated quantity: 8,3, 16, 147,456

Third layer calculated quantity: 3x 16 x 20,736;

the total calculation amount was calculated to be 214,848.

6. The method as claimed in claim 5, further comprising a verification step of comparing the calculated total calculation amount with the total calculation amount of the designed network with the input size of 24x24 and the step size of 1,2,1,2 in the prior art when N is 16, so as to obtain a verification conclusion that the calculated amount of the method is less than the calculated amount in the prior art.

7. The design method of human upper body monitoring network structure according to claim 6, wherein the prior art specifically includes setting the human upper body sample design size to 24x24, the network design stride needs to be 1,2,1,2, since the final output is Nx3x3, N is the number of channels, making a convolution of 3x3 finally regresses the score and regression position coordinates of the face candidate frame, taking the number of channels as a standard of 16, 22x22 after the first layer of convolution, 10x10 after the second layer of convolution, 8x8 at the third layer, and 3x3 at the fourth layer; the specific calculated amount is as follows:

first layer calculated amount: 22, 3, 16, 1, 69,696

Second layer calculated quantity: 10, 3, 16, 230,400

Third layer calculated quantity: 8,3, 16, 147,456

Fourth layer calculation amount: 3x 16 x 20,736

The total calculated amount is: 468,288.