CN113111679A - Design method of human-shaped upper half monitoring network structure - Google Patents

Design method of human-shaped upper half monitoring network structure Download PDF

Info

Publication number
CN113111679A
CN113111679A CN202010020483.7A CN202010020483A CN113111679A CN 113111679 A CN113111679 A CN 113111679A CN 202010020483 A CN202010020483 A CN 202010020483A CN 113111679 A CN113111679 A CN 113111679A
Authority
CN
China
Prior art keywords
layer
calculated
human
stride
input size
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010020483.7A
Other languages
Chinese (zh)
Inventor
于晓静
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Ingenic Semiconductor Co Ltd
Original Assignee
Beijing Ingenic Semiconductor Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Ingenic Semiconductor Co Ltd filed Critical Beijing Ingenic Semiconductor Co Ltd
Priority to CN202010020483.7A priority Critical patent/CN113111679A/en
Publication of CN113111679A publication Critical patent/CN113111679A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computational Linguistics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a human upper body monitoring network structure design method, which comprises the following steps: s1, initially setting, namely setting the input size to be m x m, wherein m is more than 24; when the stride is 1, the corresponding calculation formula is as follows: n-2, when the stride is 2, the corresponding calculation formula is as follows: (n-2)/2, wherein n is the input size of each layer; because the final output is Nx3x3, and N is the number of channels, performing convolution of 3x3 and finally regressing the score and the regression position coordinates of the face candidate frame; s2, setting the stride to be 2, and reversely deducing the input size according to the calculation formula in S1 to obtain the input size, the form and the calculation amount of each layer of the network; s3, according to the designed network obtained in S2, the human upper body is trained in the first stage.

Description

Design method of human-shaped upper half monitoring network structure
Technical Field
The invention relates to the field of intelligent identification, in particular to a design method of a human-shaped upper half body monitoring network structure.
Background
With the continuous development of science and technology, particularly the development of computer vision technology, the intelligent identification technology is widely applied to various fields of information security, electronic authentication and the like, and the image feature extraction method has good identification performance.
Furthermore, common terms in the art include:
detecting the upper half body: i.e. detecting the head to elbow to shoulder neutral position comprising the human body.
Cascading convolutional networks: namely, an N-order network is designed, and tasks are processed from coarse to fine.
The recall ratio is as follows: for example, the number of successfully detected face frames in the picture is the percentage of the original number of faces in the picture.
The false detection rate is as follows: for example, the number of face frames in the picture that fail to be detected accounts for the original number of faces in the picture.
The MAC number is the amount of Multiplay and add calculation.
MTCNN (Multi-task convolutional neural network), which puts face region detection and face keypoint detection together, based on the cascade framework. The whole can be divided into three-layer network structures of a first stage P-Net, a second stage R-Net and a third stage O-Net; the MTCNN implementation process comprises the following steps: outputting from the image- > P-Net- > R-Net- > 0-Net-; and (3) constructing an image pyramid, namely firstly, carrying out transformation of different scales on the image to construct the image pyramid so as to adapt to detection of identification targets with different sizes. That is, the image passes through a pyramid, generates an image of multiple scales, and is then input to the PNet. P-Net (Proposal network), the basic structure is a full connection network. And performing primary feature extraction and frame calibration on the constructed image pyramid through an FCN (full volume computer network), and performing filtration of most windows on a Bounding-Box Regression adjustment window and an NMS (network management system) algorithm. The PNet can quickly select the candidate area due to the small size, but the accuracy is not high, and then the NMS algorithm is adopted to combine the candidate frames and extract the image according to the candidate frames. The R-Net (refine network) is basically a convolutional neural network, and a full connection layer is added relative to the P-Net of the first layer, so that the screening of input data is stricter. After a picture passes through P-Net, a plurality of prediction windows are left, all the prediction windows are sent to R-Net, a network filters a large number of candidate frames with poor effects, and finally, Bounding-Box Regression and NMS algorithm are carried out on the selected candidate frames to further optimize prediction results. As an input to the RNet, the RNet can select frames accurately, typically leaving only a few frames at the end, and input the ONet at the end. The basic structure of the O-Net (output network) is a more complex convolutional neural network, and a convolutional layer is added compared with the R-Net. The effect of O-Net is different from that of R-Net in that the structure of the layer can identify the face region through more supervision and regress the face feature points of the human, and finally five feature points are output. ONet, although slower, has fewer images input into ONet because it has already obtained a high probability of bounding boxes through the first two networks, and then ONet outputs accurate bounding box and keypoint information.
However, in the prior art, there are some problems, the first-order network is generally used to generate candidate frames quickly for use in the network at the later stage, and the general first-order network design method is that the resolution is small because the maximum funnel speed is necessarily fast, for example, the design input size of a human face is 12 × 12, but such design method has a problem in the actual use engineering, and there are many candidate frames generated by the first-order network because the resolution is too small, which results in feature loss, the expression capability is reduced, and sometimes the number of candidate frames is up to several thousand. The effect is good from the recall rate, but the error rate effect is very poor, the efficiency of the later stage is greatly influenced, and the embedded product application is fatal, because a great deal of calculation amount is increased, for example, thousands of images are increased to be sent to the next stage network, and the next stage network is a fine network, the calculation amount is increased greatly, and the benefit brought by cascade use is greatly reduced.
Disclosure of Invention
In order to solve the above problems, the present invention is directed to: the cascade convolution network is divided into N stages, the function of the first stage is crucial, the invention reduces the calculated amount, greatly reduces the false detection rate and greatly reduces the network pressure of the later stages under the condition of ensuring the recall rate, and the whole project is effective when the embedded platform falls on the ground.
Specifically, the invention provides a human upper body monitoring network structure design method, which comprises the following steps:
s1, initial setting, setting input size mxm, wherein m > 24;
when the stride is 1, the corresponding calculation formula is as follows: n-2, when the stride is 2, the corresponding calculation formula is as follows: (n-2)/2, wherein n is the input size of each layer;
because the final output is Nx3x3, and N is the number of channels, performing convolution of 3x3 and finally regressing the score and the regression position coordinates of the face candidate frame;
s2, setting the stride to be 2, and reversely deducing the input size according to the calculation formula in S1 to obtain the input size, the form and the calculation amount of each layer of the network;
s3, according to the designed network obtained in S2, the human upper body is trained in the first stage.
In step S1, m is 38, i.e., the input size of the human-shaped upper body is 38x 38.
The N is 16.
The step S2 further includes:
a first layer: where the first layer n is 38, so (38-2)/2 is 18:
a second layer: as can be seen from the results obtained by the first layer calculation, the second layer n is 18, and therefore (18-2)/2 is 8:
and a third layer: as can be seen from the results obtained by the second layer calculation, the third layer n is 8, and thus (8-2)/2 is 3.
By this formula, we reverse the Input size, stride is all 2, (n-2)/stride is 3, and the inputs for each layer are 8,18, 38. Here, 3 is the size of the feature map obtained in the last layer, that is, 3 inside Nx3x3, because the final output is Nx3x3, N is the number of channels, and the score and the regression position coordinates of the face candidate frame are finally regressed by performing convolution with 3 × 3.
The calculated components of the first layer, the second layer and the third layer are respectively as follows:
first layer calculated amount: 18,3, 16, 1, 46,656
Second layer calculated quantity: 8,3, 16, 147,456
Third layer calculated quantity: 3x 16 x 20,736;
the total calculation amount was calculated to be 214,848.
And a verification step, when N is 16, comparing the calculated total calculated amount with the total calculated amount of the designed network with the input size of 24x24 and the stride of 1,2,1 and 2 in the prior art, and obtaining a verification conclusion that the calculated amount of the method is less than the calculated amount in the prior art.
The prior art specifically includes that the design size of a human-shaped upper half body sample is 24x24, the stride of network design needs to be 1,2,1,2, the final output is Nx3x3, N is the number of channels, convolution of 3x3 is performed, and finally scores and regression position coordinates of a human face candidate frame are regressed, the number of channels is 16 and serves as a standard, 22x22 is obtained after the first layer of convolution, 10x10 is obtained after the second layer of convolution, 8x8 is obtained at the third layer, and 3x3 is obtained at the fourth layer; the specific calculated amount is as follows:
first layer calculated amount: 22, 3, 16, 1, 69,696
Second layer calculated quantity: 10, 3, 16, 230,400
Third layer calculated quantity: 8,3, 16, 147,456
Fourth layer calculation amount: 3x 16 x 20,736
The total calculated amount is: 468,288.
Thus, the present application has the advantages that: the method has the advantages of high accuracy and efficiency because the increased resolution results in more feature retention, higher expression capability and more half of calculation amount reduction. The network obtained by the design method is used for carrying out first-stage training on the upper human body, more than 2000 false detections are reduced to about 100 by obtaining actual test data, the efficiency of a first-stage rough detection network and a later fine network is greatly improved, the number of frames to be processed sent to the later network is greatly reduced, and the accuracy of the first-stage network is guaranteed.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the principles of the invention.
FIG. 1 is a schematic flow diagram of the process of the present invention.
Detailed Description
In order that the technical contents and advantages of the present invention can be more clearly understood, the present invention will now be described in further detail with reference to the accompanying drawings.
As shown in FIG. 1, the invention relates to a design method of a human-shaped upper half monitoring network structure, which comprises the following steps:
s1, initially setting, namely setting the input size to be 38x 38;
where 38x38 is the core of the network structure reverse design, increasing the number of network layers will affect performance and will affect detection rate, since this determines the minimum target detection size of the network actually in the frame, and thus affects detection distance, for example, if the upper half of a human body is scaled to 24, its feature loss will be more than 38 and will cause more false detection, 56 will save more details than 38, but will affect detection distance.
When the stride is 2, the corresponding calculation formula is as follows: (n-2)/2, wherein n is the input size of each layer;
wherein, the advantage of choosing the stride to be 2 lies in: the general demand in the field of target detection is that the farther the detection distance is, the better the detection distance is, the larger the detection input resolution is, but when the resolution is increased, the detection time of the whole network will suddenly rise, and no practicability is provided, so that the stride is set to be 2, not only can the small remote target be detected, but also the detection time can be controlled to increase, and the detection time is measured for 5 meters through the actual measurement of 640 and 360 resolution, and the time consumption is 200 milliseconds, and the detection time is measured for 8 meters through 800 and 450 resolution, and the time consumption is 270 milliseconds. Even the first layer time would increase to 650 milliseconds if the stride were set to 1.
Because the final output is Nx3x3, and N is the number of channels, which is set as 16, performing convolution by 3x3 and finally regressing the score and the regression position coordinates of the face candidate frame;
s2, setting the stride to be 2, and reversely deducing the input size according to the calculation formula in S1 to obtain the input size, the form and the calculation amount of each layer of the network:
a first layer: where the first layer n is 38, so (38-2)/2 is 18, the calculated amount is:
18*18*3*3*16*1=46,656;
a second layer: as can be seen from the results obtained by the first layer calculation, since the second layer n is 18, (18-2)/2 is 8, the amount calculated is: 8,3, 16, 147,456;
and a third layer: as can be seen from the results obtained by the second layer calculation, since the third layer n is 8, (8-2)/2 is 3, the amount of calculation is: 3x 16 x 20,736.
S3, according to the designed network obtained in S2, the human upper body is trained in the first stage.
Specifically, it can be further understood as explained below:
1. the face algorithm Input size is 12X12, the human shape upper half body sample design size is one time larger than the face and is 24X24, then the network design stride needs to be 1,2,1,2, because the final output is NX3X3, N is the number of channels, 3X3 convolution is made, and finally the score and regression position coordinates of the face candidate frame are regressed, stride is a formula corresponding to 1 to calculate N-2, stride is a formula corresponding to 2 to calculate (N-2)/2(N is the Input size of each layer), for example, the human shape Input is 24X24, the same number of channels 16 is used as a standard, 22X22 is after the first layer of convolution, 10X10 is after the second layer of convolution, 8X8 is at the third layer, and 3 is at the fourth layer.
First layer calculated amount:
22*22*3*3*16*1=69,696
second layer calculated quantity:
10*10*3*3*16*16=230,400
third layer calculated quantity:
8*8*3*3*16*16=147,456
fourth layer calculation amount:
3*3*3*3*16*16=20,736
total MAC number: 468,288
2. The input size of our method is 38x38, which is an absolute advantage in terms of accuracy and efficiency because the increased resolution results in more feature retention, stronger expression capability and greatly reduced calculation amount by half
By this formula, we reverse the Input size, stride is all 2, (n-2)/stride is 3, and the inputs for each layer are 8,18, 38.
First layer calculated amount:
18*18*3*3*16*1=46,656
second layer calculated quantity:
8*8*3*3*16*16=147,456
third layer calculated quantity:
3*3*3*3*16*16=20,736
total MAC number: 214,848
3. The human upper body is trained for the first level through the network, the false detection is reduced from more than 2000 to about 100, the efficiency of a first level rough detection network and a later fine network is greatly improved, the number of frames to be processed sent to the later network is greatly reduced, and meanwhile, the accuracy of the first level network is guaranteed.
The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention, and various modifications and changes may be made to the embodiment of the present invention by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (7)

1. A human-shaped upper half monitoring network structure design method is characterized by comprising the following steps:
s1, initially setting, namely setting the input size to be m x m, wherein m is more than 24;
when the stride is 1, the corresponding calculation formula is as follows: n-2, when the stride is 2, the corresponding calculation formula is as follows: (n-2)/2, wherein n is the input size of each layer;
because the final output is Nx3x3, and N is the number of channels, performing convolution of 3x3 and finally regressing the score and the regression position coordinates of the face candidate frame;
s2, setting the stride to be 2, and reversely deducing the input size according to the calculation formula in S1 to obtain the input size, the form and the calculation amount of each layer of the network;
s3, according to the designed network obtained in S2, the human upper body is trained in the first stage.
2. The method as claimed in claim 1, wherein N is 16.
3. The method as claimed in claim 2, wherein m is 38 in step S1, that is, the input size of the upper human body is 38x 38.
4. The method for designing a human upper body monitoring network structure as claimed in claim 3, wherein the step S2 further comprises:
a first layer: wherein the first layer n is 38, so that (38-2)/2 is 18:
a second layer: as can be seen from the results obtained by the first layer calculation, the second layer n is 18, and therefore, (18-2)/2 is 8:
and a third layer: as can be seen from the second layer calculation result, the third layer n is 8, and therefore, (8-2)/2 is 3.
5. The design method of the human-shaped upper half monitoring network structure according to claim 4, wherein the calculated components of the first layer, the second layer and the third layer are respectively as follows:
first layer calculated amount: 18,3, 16, 1, 46,656
Second layer calculated quantity: 8,3, 16, 147,456
Third layer calculated quantity: 3x 16 x 20,736;
the total calculation amount was calculated to be 214,848.
6. The method as claimed in claim 5, further comprising a verification step of comparing the calculated total calculation amount with the total calculation amount of the designed network with the input size of 24x24 and the step size of 1,2,1,2 in the prior art when N is 16, so as to obtain a verification conclusion that the calculated amount of the method is less than the calculated amount in the prior art.
7. The design method of human upper body monitoring network structure according to claim 6, wherein the prior art specifically includes setting the human upper body sample design size to 24x24, the network design stride needs to be 1,2,1,2, since the final output is Nx3x3, N is the number of channels, making a convolution of 3x3 finally regresses the score and regression position coordinates of the face candidate frame, taking the number of channels as a standard of 16, 22x22 after the first layer of convolution, 10x10 after the second layer of convolution, 8x8 at the third layer, and 3x3 at the fourth layer; the specific calculated amount is as follows:
first layer calculated amount: 22, 3, 16, 1, 69,696
Second layer calculated quantity: 10, 3, 16, 230,400
Third layer calculated quantity: 8,3, 16, 147,456
Fourth layer calculation amount: 3x 16 x 20,736
The total calculated amount is: 468,288.
CN202010020483.7A 2020-01-09 2020-01-09 Design method of human-shaped upper half monitoring network structure Pending CN113111679A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010020483.7A CN113111679A (en) 2020-01-09 2020-01-09 Design method of human-shaped upper half monitoring network structure

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010020483.7A CN113111679A (en) 2020-01-09 2020-01-09 Design method of human-shaped upper half monitoring network structure

Publications (1)

Publication Number Publication Date
CN113111679A true CN113111679A (en) 2021-07-13

Family

ID=76708630

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010020483.7A Pending CN113111679A (en) 2020-01-09 2020-01-09 Design method of human-shaped upper half monitoring network structure

Country Status (1)

Country Link
CN (1) CN113111679A (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160148079A1 (en) * 2014-11-21 2016-05-26 Adobe Systems Incorporated Object detection using cascaded convolutional neural networks
CN107506707A (en) * 2016-11-30 2017-12-22 奥瞳系统科技有限公司 Using the Face datection of the small-scale convolutional neural networks module in embedded system
US20180150684A1 (en) * 2016-11-30 2018-05-31 Shenzhen AltumView Technology Co., Ltd. Age and gender estimation using small-scale convolutional neural network (cnn) modules for embedded systems
CN109472247A (en) * 2018-11-16 2019-03-15 西安电子科技大学 Face identification method based on the non-formula of deep learning
CN110175504A (en) * 2019-04-08 2019-08-27 杭州电子科技大学 A kind of target detection and alignment schemes based on multitask concatenated convolutional network
WO2019200749A1 (en) * 2018-04-17 2019-10-24 平安科技(深圳)有限公司 Facial recognition method, apparatus, computing device and storage medium
CN110619319A (en) * 2019-09-27 2019-12-27 北京紫睛科技有限公司 Improved MTCNN model-based face detection method and system

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160148079A1 (en) * 2014-11-21 2016-05-26 Adobe Systems Incorporated Object detection using cascaded convolutional neural networks
CN107506707A (en) * 2016-11-30 2017-12-22 奥瞳系统科技有限公司 Using the Face datection of the small-scale convolutional neural networks module in embedded system
US20180150684A1 (en) * 2016-11-30 2018-05-31 Shenzhen AltumView Technology Co., Ltd. Age and gender estimation using small-scale convolutional neural network (cnn) modules for embedded systems
WO2019200749A1 (en) * 2018-04-17 2019-10-24 平安科技(深圳)有限公司 Facial recognition method, apparatus, computing device and storage medium
CN109472247A (en) * 2018-11-16 2019-03-15 西安电子科技大学 Face identification method based on the non-formula of deep learning
CN110175504A (en) * 2019-04-08 2019-08-27 杭州电子科技大学 A kind of target detection and alignment schemes based on multitask concatenated convolutional network
CN110619319A (en) * 2019-09-27 2019-12-27 北京紫睛科技有限公司 Improved MTCNN model-based face detection method and system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
尹茜: "基于轻量级神经网络的人脸检测算法", 常州信息职业技术学院学报, vol. 18, no. 06, pages 23 - 27 *
朱鹏等: "一种轻量级的多尺度特征人脸检测方法", 计算机技术与发展 *

Similar Documents

Publication Publication Date Title
Jiang et al. HDCB-Net: A neural network with the hybrid dilated convolution for pixel-level crack detection on concrete bridges
CN109117876B (en) Dense small target detection model construction method, dense small target detection model and dense small target detection method
CN108537215B (en) Flame detection method based on image target detection
CN110826379B (en) Target detection method based on feature multiplexing and YOLOv3
CN106778705B (en) Pedestrian individual segmentation method and device
CN110084292A (en) Object detection method based on DenseNet and multi-scale feature fusion
CN102496001B (en) Method of video monitor object automatic detection and system thereof
CN110751099B (en) Unmanned aerial vehicle aerial video track high-precision extraction method based on deep learning
CN109284670A (en) A kind of pedestrian detection method and device based on multiple dimensioned attention mechanism
CN106874894A (en) A kind of human body target detection method based on the full convolutional neural networks in region
CN108596053A (en) A kind of vehicle checking method and system based on SSD and vehicle attitude classification
CN111160249A (en) Multi-class target detection method of optical remote sensing image based on cross-scale feature fusion
CN110991311A (en) Target detection method based on dense connection deep network
Li et al. Automatic bridge crack identification from concrete surface using ResNeXt with postprocessing
CN105608456A (en) Multi-directional text detection method based on full convolution network
CN105869146B (en) SAR image change detection based on conspicuousness fusion
CN108960261A (en) A kind of obvious object detection method based on attention mechanism
CN114612937B (en) Pedestrian detection method based on single-mode enhancement by combining infrared light and visible light
CN105975925A (en) Partially-occluded pedestrian detection method based on joint detection model
CN112036381B (en) Visual tracking method, video monitoring method and terminal equipment
CN108986142A (en) Shelter target tracking based on the optimization of confidence map peak sidelobe ratio
CN109558803B (en) SAR target identification method based on convolutional neural network and NP criterion
CN108460336A (en) A kind of pedestrian detection method based on deep learning
CN115131580B (en) Space target small sample identification method based on attention mechanism
CN110751644A (en) Road surface crack detection method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination