CN113449691A

CN113449691A - Human shape recognition system and method based on non-local attention mechanism

Info

Publication number: CN113449691A
Application number: CN202110825928.3A
Authority: CN
Inventors: 孙磊; 闫恒心; 董恩增; 佟吉钢; 金帅羽
Original assignee: Tianjin University of Technology
Current assignee: Tianjin University of Technology
Priority date: 2021-07-21
Filing date: 2021-07-21
Publication date: 2021-09-28

Abstract

A humanoid recognition system based on a non-local attention mechanism comprises a full convolution neural network model based on YOLOv5s and is composed of an input end, an improved backhaul network unit, a Neck network unit and a Head network unit, wherein the improved backhaul network unit is composed of a Focus focusing module, a CBLSP1_1 module, a CBLSP1_3 module, a CBL unit XI, an SPP module and a non-local attention mechanism module.

Description

Human shape recognition system and method based on non-local attention mechanism

[ technical field ]:

the invention belongs to the technical field of computer vision correlation, and particularly relates to a human shape recognition system and method based on a non-local attention mechanism.

[ background Art ] A method of:

the human shape recognition is a technology for recognizing and positioning human shape targets in images through human body imaging characteristics, and can play an important role in emerging fields such as intelligent monitoring, motion analysis and virtual reconstruction.

The recognition of the early humanoid target has no standardized algorithm, and more traditional classification methods such as SVM, adaboost and the like are used. The recognition accuracy and the recognition time of the model are limited, the recognition accuracy is low under the condition that the real-time recognition requirement can be met, and the recognition time needs to be prolonged as the cost when the accuracy is improved.

With the development of deep learning in the field of machine learning, a method for iteratively constructing a neural network model by using data training becomes more mature, and a certain achievement can be obtained in a real-time human figure recognition task at present. However, since various scenes are involved in the process of acquiring experimental data, and attributes such as ages, sexes, postures and clothes of different categories are difficult to distinguish, a large amount of auxiliary information needs to be introduced in the data training process to ensure the identification accuracy, so that a large amount of parameters occupy equipment resources, extra consumption is caused, and the instantaneity is reduced. Many scenes require complex scenes to be identified and reacted to in near unconscious situations, such as autonomous driving, require real-time detection and reaction of various types of objects in the road, and the reaction capability and accuracy of many current models are difficult to meet.

The attention mechanism is a resource allocation scheme capable of effectively solving the problem of information overload, and the computing power of hardware is focused on processing information which needs to be focused by taking the advantage of the signal processing characteristics of human brain. The connection weight among the features is dynamically generated by the attention mechanism, so that the feature extraction process is more targeted without greatly increasing the number of the features, and the attention mechanism is used as a component of the neural network to improve the identification efficiency of the model.

[ summary of the invention ]:

the invention aims to provide a human shape recognition system and a human shape recognition method based on a non-local attention mechanism, which can overcome the defects of the prior art and are simple, feasible and good in robustness.

The technical scheme of the invention is as follows: a humanoid recognition system based on a non-local attention mechanism comprises a full Convolution neural network model based on YOLOv5s, and is composed of an input end, a Backbone network unit, a Neck network unit and a Head network unit, and is characterized in that the Backbone network unit is an improved Backbone structure, the enhanced feature extraction effect of the non-local attention mechanism is increased, and the humanoid recognition system is composed of a Focus focusing module, a CBLSP1_1 module, a CBLSP1_3 module, a CBL (Convolution + Batch Normalization + activation function, a Convolution + Batch Normalization + Leaky _ Rezalu) unit, an XI (Spatial Pyramid Pooling) module, an SPP (Spatial Pyramid clustering Power) module and a non-local attention mechanism module.

The number of the CBLSP1_3 modules is two, namely a CBLSP1_3 module I and a CBLSP1_3 module II; the number of the non-local attention mechanism modules is two, namely a non-local attention mechanism module I and a non-local attention mechanism module II; the Focus module, the CBLSP1_1 module, the CBLSP1_3 module I and the non-local attention mechanism module I are connected in series in a data connection mode, the input end of the Focus module receives an original image signal to be classified, and the output end of the Focus module outputs extracted image features which serve as first features of an image; the CBLSP1_3 module II and the non-local attention mechanism module II are connected in a serial data mode, the input end receives a first characteristic signal of an image, and the output end outputs the image characteristic after weighting processing and takes the image characteristic as a second characteristic of the image; the CBL unit VIII and the SPP module are connected in a serial data mode, the input end of the CBL unit VIII receives a second characteristic signal of the image, the output of the CBL unit VIII is an image characteristic obtained after the second weighting processing, and the image characteristic is used as a third characteristic of the image.

The CBLSP1_1 module is based on a structure of a cross-stage local network structure and is composed of a CBL unit II, a CBL unit III, 1 residual component, a convolutional layer I, a convolutional layer II, a tensor splicing unit II, a batch normalization unit I, an activation function unit I and a CBL unit IV; the input end of the CBL unit II receives a characteristic diagram extracted by the Focus focusing module, and the output end of the CBL unit II is respectively connected with the CBL unit III and the convolution layer II; the output end of the CBL unit III is sequentially connected with the input ends of a residual error assembly, a convolutional layer I and a tensor splicing unit II; the output end of the convolution layer II is connected with the input end of the tensor splicing unit II; the tensor splicing unit II is used for splicing images according to the image characteristics output by the convolution layer I and the convolution layer II, and the output end of the tensor splicing unit II is sequentially connected with the batch normalization unit I, the activation function unit I and the CBL unit IV; and the output end of the CBL unit IV outputs the feature map extracted by the CBLSP1_1 module to a CBLSP1_3 module I.

The CBLSP1_3 module I is based on a cross-stage local network structure; the device consists of a CBL unit V, CBL unit VI, 3 residual component units, a convolutional layer III, a convolutional layer IV, a tensor splicing unit III, a batch normalization unit II, an activation function unit II and a CBL unit VII; the input end of the CBLSP1_3 module I is the input end of the CBL unit V, and is used for receiving the characteristic diagram output by the CBLSP1_1 module, and the output end of the CBLSP1_3 module I is respectively connected with the CBL unit VI and the convolutional layer IV; the output end of the CBL unit VI is sequentially connected with three residual error components and a convolution layer III in series; the input end of the tensor splicing unit III receives the image characteristic information output by the convolutional layer III and the convolutional layer IV, image splicing is carried out on the image characteristic information, the output end of the tensor splicing unit III is sequentially connected with the batch normalization unit II, the activation function unit II and the CBL unit VII, and finally the characteristic image further extracted by the CBLSP1_3 module I is output to the non-local attention mechanism module I.

The CBLSP1_3 module II is a module that has the same structure and data processing process as the CBLSP1_3 module I and is independent, and is composed of a CBL unit VIII, a CBL unit IX, 3 residual component units, a convolutional layer V, a convolutional layer VI, a tensor splicing unit IV, a batch normalization unit III, an activation function unit III, and a CBL unit X; the input end of the CBLSP1_3 module II is the input end of the CBL unit VIII and is used for receiving the first characteristic output by the non-local attention mechanism module I, and the output end of the CBLSP1_3 module II is respectively connected with the CBL unit IX and the convolutional layer VI; the output end of the CBL unit IX is sequentially connected with three residual error assemblies and a convolution layer V in series; the input end of the tensor splicing unit IV receives the image characteristic information output by the convolutional layer V and the convolutional layer VI, carries out image splicing on the image characteristic information, the output end of the tensor splicing unit IV is sequentially connected with the batch normalization unit III, the activation function unit III and the CBL unit X, and finally outputs the first characteristic further extracted by the CBLSP1_3 module II to the input end of the non-local attention mechanism module II.

The Focus focusing module is formed by sequentially connecting a slice processing unit, a tensor splicing unit I and a CBL unit I, and more complete picture information can be reserved by using the Focus focusing module compared with the conventional convolution operation.

The SPP module is composed of a CBL unit XII, a maximum pooling unit, a tensor splicing unit V and a CBL unit XIII, and the input end of the SPP module, namely the input end of the CBL unit XII, is connected with the output end of the CBL unit XI and used for receiving a feature map which is obtained by performing down-sampling extraction on the second features and outputting the second features by the CBL unit XI and further extracting third features; the output end of the CBL unit XII is respectively connected with the maximum pooling unit, the tensor splicing unit V and the CBL unit XIII; the input end of the tensor splicing unit V simultaneously receives the four characteristic maps output by the maximum pooling unit and splices the images, and the output end of the tensor splicing unit V is connected with the CBL unit XIII; finally, the third characteristic is output to the Neck network from the output end of the CBL unit XIII.

The maximum pooling unit is composed of three maximum pooling modules, can use three maximum pooling sizes to perform feature acquisition on the feature map, and processes the feature map into the feature map with the same size and dimension.

The CBL unit I, CBL unit II, the CBL unit III, the CBL unit IV, the CBL unit V, CBL unit VI, the CBL unit VII, the CBL unit VIII, the CBL unit IX, the CBL unit X, CBL unit XI, the CBL unit XII and the CBL unit XIII are all formed by sequentially connecting a convolution layer, a batch normalization layer and an activation function layer, are used as minimum components in a YOLOv5s network structure, and complete feature extraction on a feature map through convolution operation; wherein, the convolution layers in the CBL unit I, CBL unit II, the CBL unit V, CBL unit VIII and the CBL unit XI use a 3 x 3 convolution structure to carry out downsampling and feature extraction; the convolution layers of the convolution I, the convolution II, the convolution III, the convolution IV, the convolution V, the convolution VI, the CBL unit III, the CBL unit IV, the CBL unit VI, the CBL unit VII, the CBL unit IX, the CBL unit X, CBL unit XII and the CBL unit XIII are all subjected to 1 x 1 convolution, so that the channel number of the feature map is unified on the premise that the size of the feature map is not changed, and the feature map splicing is facilitated.

A human shape recognition method based on a non-local attention mechanism is characterized by comprising the following steps:

(1) acquiring an original picture to be identified, and performing data enhancement, self-adaptive bounding box calculation and picture scaling operation on the picture at an input end of a full convolution neural network model based on YOLOv5 s;

the original picture to be identified in the step (1) is from a human-shaped picture data set, and the data set is composed of pictures and human-shaped target labeling information; the picture is obtained by selecting a scene to be measured by a user for live-action shooting, and the picture is required to be not less than 10000 pieces and has a resolution ratio of more than 608 multiplied by 608.

The specific treatment process of the step (1) comprises the following steps:

(1-1) performing data enhancement on an original picture to be recognized, namely: randomly reading four pictures from the data set each time, respectively reducing the pictures according to a random proportion of 10-100%, splicing the four reduced pictures into a new picture serving as a picture to be trained, and enriching the number of human-shaped targets in the pictures by the method;

(1-2) performing adaptive bounding box calculation on the picture enhanced in the step (1-1), namely: firstly, in a YOLO algorithm, setting an initial boundary box according to the length-width ratio and the size of a humanoid target in a data set, then outputting a prediction boundary box by a humanoid recognition network according to the initial boundary box in a training process, and performing back propagation by using the difference between the prediction boundary box and a labeling boundary box to realize the self-adaptive updating of the boundary box;

(1-3) uniformly carrying out zooming operation on the pictures to be trained obtained in the step (1-1) for training, namely: giving the length and width dimensions of the training picture, dividing the given length and width dimensions by the length and width dimensions of the picture to be trained to respectively obtain length and width scaling coefficients, comparing the two scaling coefficients, and multiplying the length and width of the picture to be trained by the minimum scaling coefficient to obtain the scaled dimensions; at the moment, the picture to be trained is reduced to the state that the long edge of the two edges is equal to the given training picture size, the short edge size is smaller than the given training picture size, the given short edge size is used for subtracting the short edge size to obtain a blank size, the blank size is divided by 2 to obtain the sizes of the two ends of the short edge to be filled respectively, and RGB (Red Green Blue, Red Green Blue color standard) colors are used for filling.

The given size of the training picture in said step (1-3) is 608 × 608; the parameters of the RGB colors used are (114, 114, 114).

(2) Inputting the picture to be identified processed in the step (1) as an input signal to a Focus module in a backhaul network unit, and preliminarily extracting complete picture information;

the specific treatment process of the step (2) comprises the following steps:

(2-1) receiving the picture to be identified with the given size obtained in the step (1) by the input end of the slicing processing unit, and carrying out slicing operation on the picture;

the slicing operation in the step (2-1) specifically includes: obtaining four complementary pictures by a method of obtaining pixel point information by pixel separation in the received picture to be identified with the given size obtained in the step (1), namely: respectively extracting and distributing adjacent pixel point information in the picture to be identified to different slice pictures one by one, and finally enabling four pixel points in any 2 x 2 pixel region of the picture to be identified to respectively belong to four different slice pictures, so that four complementary slice pictures with areas reduced by four times and different from each other and containing all information of the picture to be identified can be obtained;

(2-2) splicing the slice images processed in the step (2-1) by a tensor splicing unit I in the Focus focusing module to obtain a characteristic diagram with the length and width reduced to a half of the image to be identified and the number of channels increased to four times of the image to be identified, namely: splicing the four sliced complementary pictures by a tensor splicing unit I to obtain a characteristic picture with the length and width reduced to be half of the picture to be identified and the number of channels expanded to be four times of the picture to be identified;

(2-3) processing the characteristic diagram obtained in the step (2-2) by a CBL unit I: carrying out convolution operation on the feature map for one time by the convolution layer to obtain a double-sampling feature map without information loss, optimizing feature distribution by the batch normalization layer by using the mean value and variance of data, and finally completing nonlinear mapping of the feature map by the activation function layer;

the activation function layer uses a Leaky ReLU function to assign a non-zero slope a to all negative values_iThe positive value is kept unchanged, so that the nonlinear characteristic is introduced, and the Leaky ReLU function expression is shown as 2-1:

wherein is y_iOutput signal, x_iIs an input signal, with a non-zero slope a_iThe value range of (1, + ∞).

The processed feature map size was 304 × 304.

(3) The feature map extracted in the step (2) is sent to a subsequent module of the backhaul network unit for deep feature extraction, and a first feature, a second feature and a third feature of the image are respectively obtained;

the specific treatment process comprises the following steps:

(3-1) sending the feature map processed in the step (2) to a CBLSP1_1 module, a CBLSP1_3 module I and a non-local attention mechanism module I to extract a first feature;

(3-2) sending the first features into a CBLSP1_3 module II and a non-local attention mechanism module II for weighting processing to extract second features;

the non-local attention mechanism module I in the step (3-1) and the non-local attention mechanism module II in the step (3-2) have the same structure, a super-large scale receptive field is established by utilizing the product of channels, a global dependency relationship is captured, a full graph characteristic is extracted by learning weight distribution, both the two use an embedded Gaussian function to represent similarity, and two embeddings are used as formulas (3-1) and (3-2):

θ(x_i)＝W_θx_i (3-1)

φ(x_j)＝W_θx_j (3-2)

obtaining the embedded Gaussian function form as formula (3-3):

where x is the input signal, i is the index of the output position in space of the calculated response, j represents the index of all possible positions, W_θLearning by 1 × 1 convolution as a weight matrix, theta and phi represent two feature embeddings, learning by 1 × 1 convolution, f represents the similarity of a signal x between i and all j as an embedded Gaussian function, e is a natural constant, and T represents matrix transposition;

the final output signal y_iAs shown in the formula (3-4),

wherein, W_gx_jCan be regarded as a linear embedding, W_gObtained by 1 x 1 convolution as a weight matrix,

representing the sum of the similarity f for all j positions.

The specific steps of the feature processing of the non-local attention mechanism module I and the non-local attention mechanism module II are as follows:

(3-1-1) performing linear mapping on the input characteristic diagram X of the [ c, h, w ] structure by using 1X 1 convolution to obtain characteristics phi, theta and g with the number of channels reduced by half;

(3-1-2) using an embedded Gaussian function to perform inner product calculation attention on the feature phi and the transposed and rearranged feature theta, and normalizing to obtain an attention coefficient;

(3-1-3) corresponding the attention coefficient to the feature g, and convolving the upper expansion channel number by 1 × 1;

and (3-1-4) adding the original input feature map X to obtain an output feature with global information.

And (3-3) sending the second features into the CBL unit and the SPP module for weighting processing, and then extracting third features.

In the step (3), the first characteristic size is 76 × 76, the second characteristic size is 38 × 38, and the third characteristic size is 19 × 19.

(4) Receiving the three image characteristics extracted in the step (3) by a Neck network, performing characteristic fusion on the three image characteristics by using a tensor splicing method, improving the diversity and robustness of the image characteristics, obtaining three characteristic images, and respectively recording the three characteristic images as a characteristic image I, a characteristic image II and a characteristic image III;

the specific treatment process comprises the following steps:

(4-1) splicing the third feature extracted in the step (3-3) with the second feature through a tensor splicing VI unit after the size of the third feature is enlarged through an upsampling unit I, splicing the splicing result with the first feature extracted in the step (3-1) through a tensor splicing unit VII after the size of the third feature is enlarged through upsampling, and then reducing the number of channels through a convolutional layer VII to obtain a first feature map;

(4-2) splicing the third feature extracted in the step (3-3) with the splicing result of the second feature after the size of the third feature is enlarged by the up-sampling unit II and the feature map I with the reduced size of the convolutional layer VIII by the tensor splicing unit VIII, and then reducing the number of channels by the convolutional layer IX to obtain a feature map II;

(4-3) splicing the third feature extracted in the step (3-3) with the feature map with the reduced size under the convolution X by a tension splicing unit IX, and then reducing the number of channels by a convolution layer XI to obtain a feature map III;

in the step (4), the first size of the feature map is 76 × 76, the second size of the feature map is 38 × 38, the third size of the feature map is 19 × 19, the convolution sizes of the convolutional layer VIII and the convolutional layer X are 3 × 3, and the convolution sizes of the convolutional layer VII, the convolutional layer IX, and the convolutional layer XI are 1 × 1.

(5) The Head network receives the three characteristic images with the channel number of 6 obtained by processing in the step (4), tensor splicing is carried out on the three characteristic images to obtain a dimension matrix related to a spliced characteristic image, then all humanoid targets identified by the model are determined by calculating the loss of the boundary frame and utilizing a non-maximum suppression method, the boundary frame is drawn according to humanoid coordinates contained in the dimension matrix, finally, pictures marked with all humanoid boundary frames are output, and meanwhile, the pictures marked by the model are reversely transmitted to the input end to carry out self-adaptive boundary frame calculation, updating and predicting the positions of the boundary frames and carrying out the next round of training;

the characteristic diagram splicing mode of the step (5) is shown as a formula (5-1):

76×76×3×6+38×38×3×6+19×19×3×6＝22743×6 (5-1)

wherein 76 × 76 is the size of the feature map one, the number of RGB channels is 3, and the length of the dimension matrix is 6; 38 x 38 is feature map two expressed in size; 19 x 19 is the dimension of feature three, then 22743 dimension matrices of length 6 are finally formed after stitching;

the dimension matrix is a representation form of the prediction result, and the form of a single dimension matrix is shown as a formula (5-2):

[t_x t_y t_w t_h n p] (5-2)

wherein, the first four digits are the boundary box coordinate of network prediction, t_x、t_yInformation representing the coordinates of the center of the human-shaped bounding box, t_w、t_hRepresenting the width and height information of the human-shaped bounding box; the fifth digit n represents the bounding box confidence; the last bit p then represents the humanoid probability.

The loss calculated in the step (5) is calculated by adopting GIoU _ loss as a loss function of the bounding box, and the loss calculating process of the GIoU _ loss is shown as the formula (5-3):

wherein L is_{GIo U}To lose GIoU _ loss, A^cThe minimum bounding rectangle area of the two windows is represented, B represents the total area of the two windows, IoU represents the overlapping degree of the two windows, the overlapping area of the two windows is the ratio of the total area, and all the areas are calculated by the bounding box information contained in the dimension matrix.

The non-maximum suppression method in the step (5) is to determine a window by using the human-shaped coordinates contained in each dimensional matrix, and simultaneously compare the confidence degrees of each dimensional matrix to finally determine the window with the highest confidence degree of all local positions, wherein the specific processing process comprises the following steps:

(5-1) firstly, sequencing the dimensional matrix according to the confidence degree of the dimensional matrix, selecting a window contained in the dimensional matrix with the maximum confidence degree, and calling the window as a window with the maximum confidence degree;

(5-2) selecting a first adjacent window adjacent to the selected maximum confidence window, and calculating the overlapping degree of the maximum confidence window and the selected first adjacent window; the overlapping degree is the ratio of the overlapping area of the two windows to the total area;

(5-3) comparing the overlapping degree with a set overlapping degree threshold value, and if the overlapping degree exceeds the set overlapping degree threshold value, indicating that the overlapping degree of the two windows is too high, namely indicating that the marked objects are the same humanoid target, excluding the adjacent window; if the overlapping degree is lower than the set overlapping degree threshold value, the overlapping degree of the two windows is very low, namely the marked objects are different humanoid targets, and the adjacent windows are temporarily reserved;

(5-4) selecting a second adjacent window to calculate and compare the overlapping degree with the maximum confidence window, repeating the step (5-2) and the step (5-3) until all windows which identify the same humanoid target are eliminated after the comparison with the rest windows is finished, and selecting the maximum confidence window as a first humanoid target to achieve the purpose that only one window is reserved for one humanoid target;

(5-5) selecting a second large confidence window from the windows temporarily reserved after the steps, and repeating the steps (5-2), (5-3) and (5-4) until a second humanoid target is selected;

(5-6) repeating the step (5-5) until all humanoid targets are found.

And (5) taking 0.5 as the threshold value of the window overlapping degree in the step (5).

(6) And (5) circularly executing the steps (1) to (5) to obtain a trained human shape recognition network model, and storing the model.

The working principle of the invention is as follows: the classification effect is greatly improved by capturing the long-distance dependency relationship in the image, the long-distance dependency between pixels is enhanced by the non-local attention mechanism, the input and output sizes and structures of the non-local attention mechanism module are consistent, the non-local attention mechanism module can be combined with any architecture for use, and the non-local attention mechanism can be added between any two convolution layers for use without causing structural conflict when the deep convolutional neural network is improved.

Therefore, the invention aims to adopt a non-local attention mechanism as a classification model component to establish a super-large scale receptive field, reasonably and effectively strengthen convolution characteristics by utilizing redundant information in the environment, enhance useful channels, inhibit useless channels and realize the enhancement of detection effect.

In order to balance real-time performance and accuracy, the improved network model is based on a Yolov5s network structure, and a backbone part of the improved network model is modified. Since the image is still large after a few convolution operations in the initial stage of the backhaul, capturing long-distance information in the image is more effective. Therefore, a non-local attention mechanism is added after the two CSP1-3 modules, and long-distance dependence in the image is captured by using the non-local attention mechanism in a feed-forward mode and is respectively transmitted backwards to a neutral part to further extract feature diversity. The Neck network follows the YOLOv5s raw Neck section, which takes the FPN + PAN structure. The FPN (Feature Pyramid network) is used to transfer Feature information of different scales from top to bottom, perform Feature fusion on low-resolution features and high-resolution features, and realize prediction on images of different sizes. PAN (Pyramid Attention network) is used for transmitting feature information from bottom to top, further enhances the feature extraction capability, and the accuracy can be effectively improved by combining the feature information and the PAN.

The invention has the advantages that: the network model is improved on the premise of not increasing the calculation cost, and the speed advantage of the YOLOv5 algorithm is inherited. The non-local attention characteristic is utilized, the target detection process is not limited by local receptive fields, but global information is used, the feature fusion capability is enhanced, the real-time performance is considered, and meanwhile, the accuracy of network identification is effectively improved.

[ description of drawings ]:

FIG. 1 is a schematic diagram of a network model of a human form recognition system based on a non-local attention mechanism according to the present invention.

Fig. 2 is a schematic structural diagram of an improved backhaul network part of a humanoid recognition system based on a non-local attention mechanism according to the present invention (where fig. 2-a is a schematic structural diagram of an improved backhaul network element, fig. 2-b is a schematic structural diagram of a CBLSP1_1 module, fig. 2-c is a schematic structural diagram of a CBLSP1_3 module I, fig. 2-d is a schematic structural diagram of a CBLSP1_3 module II, fig. 2-e is a schematic structural diagram of a Focus module, fig. 2-f is a schematic structural diagram of an SPP module, and fig. 2-g is a schematic structural diagram of a CBL element).

Fig. 3 is a schematic diagram illustrating the operation of the Focus module slicing in the human form recognition system based on the non-local attention mechanism according to the present invention.

Fig. 4 is a schematic structural diagram of a non-local attention mechanism in a human form recognition system based on the non-local attention mechanism according to the present invention.

FIG. 5 is a schematic diagram of a Neck network feature fusion process of a human shape recognition method based on a non-local attention mechanism.

Fig. 6 is a schematic overall flow chart of a human form recognition method based on a non-local attention mechanism according to the present invention.

[ embodiments ] of the present invention:

the present invention will be described in detail with reference to specific examples. It should be understood that the examples are for illustrative purposes only and are not intended to limit the scope of the present invention. After reading the detailed steps and related contents of the present invention, those skilled in the art can make various modifications and changes to the present invention based on the technical disclosure, and the equivalents thereof also fall within the scope of the claims appended to the present application.

A humanoid recognition system based on a non-local attention mechanism comprises a full convolution neural network model based on YOLOv5s, and is composed of an input end, a backhaul network unit, a Neck network unit and a Head network unit, the structure is shown in figure 1, the humanoid recognition system is characterized in that the backhaul network unit is an improved backhaul structure, the non-local attention mechanism enhanced feature extraction effect is increased, the humanoid recognition system is composed of a Focus focusing module, a CBLSP1_1 module, a CBLSP1_3 module, a CBL unit XI, an SPP module and a non-local attention mechanism module, and the structure is shown in figure 2-a.

The number of the CBLSP1_3 modules is two, namely a CBLSP1_3 module I and a CBLSP1_3 module II; the number of the non-local attention mechanism modules is two, namely a non-local attention mechanism module I and a non-local attention mechanism module II; the Focus module, the CBLSP1_1 module, the CBLSP1_3 module I and the non-local attention mechanism module I are connected in series in a data connection mode, the input end of the Focus module receives an original image signal to be classified, and the output end of the Focus module outputs extracted image features which serve as first features of an image; the CBLSP1_3 module II and the non-local attention mechanism module II are connected in a serial data mode, the input end receives a first characteristic signal of an image, and the output end outputs the image characteristic after weighting processing and takes the image characteristic as a second characteristic of the image; the CBL unit VIII and the SPP module are connected in series, and the input end of the CBL unit VIII and the SPP module receives the second feature signal of the image, and the output of the CBL unit VIII and the SPP module is the image feature obtained after the second weighting process, and the image feature is used as the third feature of the image, as shown in fig. 2-a.

The CBLSP1_1 module is based on a structure of a cross-phase local network structure, and as shown in fig. 2-b, is composed of a CBL unit II, a CBL unit III, 1 residual component, a convolutional layer I and a convolutional layer II, a tensor splicing unit II, a batch normalization unit I, an activation function unit I, and a CBL unit IV; the input end of the CBL unit II receives a characteristic diagram extracted by the Focus focusing module, and the output end of the CBL unit II is respectively connected with the CBL unit III and the convolution layer II; the output end of the CBL unit III is sequentially connected with the input ends of a residual error assembly, a convolutional layer I and a tensor splicing unit II; the output end of the convolution layer II is connected with the input end of the tensor splicing unit II; the tensor splicing unit II is used for splicing images according to the image characteristics output by the convolution layer I and the convolution layer II, and the output end of the tensor splicing unit II is sequentially connected with the batch normalization unit I, the activation function unit I and the CBL unit IV; and the output end of the CBL unit IV outputs the feature map extracted by the CBLSP1_1 module to a CBLSP1_3 module I.

The CBLSP1_3 module I is an architecture based on a cross-phase local network structure, as shown in fig. 2-c; the device consists of a CBL unit V, CBL unit VI, 3 residual component units, a convolutional layer III, a convolutional layer IV, a tensor splicing unit III, a batch normalization unit II, an activation function unit II and a CBL unit VII; the input end of the CBLSP1_3 module I is the input end of the CBL unit V, and is used for receiving the characteristic diagram output by the CBLSP1_1 module, and the output end of the CBLSP1_3 module I is respectively connected with the CBL unit VI and the convolutional layer IV; the output end of the CBL unit VI is sequentially connected with three residual error components and a convolution layer III in series; the input end of the tensor splicing unit III receives the image characteristic information output by the convolutional layer III and the convolutional layer IV, image splicing is carried out on the image characteristic information, the output end of the tensor splicing unit III is sequentially connected with the batch normalization unit II, the activation function unit II and the CBL unit VII, and finally the characteristic image further extracted by the CBLSP1_3 module I is output to the non-local attention mechanism module I.

The CBLSP1_3 module II is a module that has the same structure and data processing process as the CBLSP1_3 module I, and is independent, and the structure is shown in fig. 2-d, and is composed of a CBL unit VIII, a CBL unit IX, 3 residual assembly units, a convolutional layer V, a convolutional layer VI, a tensor splicing unit IV, a batch normalization unit III, an activation function unit III, and a CBL unit X; the input end of the CBLSP1_3 module II is the input end of the CBL unit VIII and is used for receiving the first characteristic output by the non-local attention mechanism module I, and the output end of the CBLSP1_3 module II is respectively connected with the CBL unit IX and the convolutional layer VI; the output end of the CBL unit IX is sequentially connected with three residual error assemblies and a convolution layer V in series; the input end of the tensor splicing unit IV receives the image characteristic information output by the convolutional layer V and the convolutional layer VI, carries out image splicing on the image characteristic information, the output end of the tensor splicing unit IV is sequentially connected with the batch normalization unit III, the activation function unit III and the CBL unit X, and finally outputs the first characteristic further extracted by the CBLSP1_3 module II to the input end of the non-local attention mechanism module II.

The Focus focusing module is formed by sequentially connecting a slice processing unit, a tensor splicing unit I and a CBL unit I, more complete picture information can be reserved by using the Focus focusing module compared with the conventional convolution operation, and the structure is shown in a figure 2-e.

The SPP module is composed of a CBL unit XII, a maximum pooling unit, a tensor splicing unit V and a CBL unit XIII, the structure is shown in fig. 2-f, and the input end of the SPP module, namely the input end of the CBL unit XII, is connected with the output end of the CBL unit XI and is used for receiving a feature map which is obtained by performing downsampling extraction on the second features and outputting the second features by the CBL unit XI and further extracting third features; the output end of the CBL unit XII is respectively connected with the maximum pooling unit, the tensor splicing unit V and the CBL unit XIII; the input end of the tensor splicing unit V simultaneously receives the four characteristic maps output by the maximum pooling unit and splices the images, and the output end of the tensor splicing unit V is connected with the CBL unit XIII; finally, the third characteristic is output to the Neck network from the output end of the CBL unit XIII.

The CBL unit I, CBL unit II, the CBL unit III, the CBL unit IV, the CBL unit V, CBL unit VI, the CBL unit VII, the CBL unit VIII, the CBL unit IX, the CBL unit X, CBL unit XI, the CBL unit XII and the CBL unit XIII are all formed by sequentially connecting a convolution layer, a batch normalization layer and an activation function layer, are used as minimum components in a YOLOv5s network structure, and complete feature extraction on a feature map through convolution operation, and the structure is shown in FIGS. 2-g; wherein, the convolution layers in the CBL unit I, CBL unit II, the CBL unit V, CBL unit VIII and the CBL unit XI use a 3 x 3 convolution structure to carry out downsampling and feature extraction; the convolution layers of the convolution I, the convolution II, the convolution III, the convolution IV, the convolution V, the convolution VI, the CBL unit III, the CBL unit IV, the CBL unit VI, the CBL unit VII, the CBL unit IX, the CBL unit X, CBL unit XII and the CBL unit XIII are all subjected to 1 x 1 convolution, so that the channel number of the feature map is unified on the premise that the size of the feature map is not changed, and the feature map splicing is facilitated.

In the embodiments described below, the data acquisition and model construction process of the human form recognition network model is detailed first, and then the training process of the human form recognition method based on the non-local attention mechanism for obtaining the human form recognition result for the human form image is detailed.

The human shape recognition method flow is shown in fig. 6. Before training the model, data preparation is carried out according to the following steps:

(1) the method comprises the steps of selecting market, street and restaurant scenes, taking a live-action shot of the scenes, and making a picture set capable of describing the scenes, wherein the picture set is shown in figure 6 for image acquisition.

(2) And (3) examining and preprocessing the picture set, deleting pictures lacking the unexpected reference object of the human-shaped target, cutting partial pictures with excessive background information, and reserving 10000 pictures containing the human-shaped target with the resolution of more than 608 multiplied by 608, which is shown in figure 6 for image acquisition.

(3) Manually labeling the picture set obtained in the step (2) by using a labelimg tool to manufacture a human shape recognition data set; the labeling process requires that the human-shaped targets in the picture comprise different genders, ages, clothes, postures and angles, each human-shaped target in the data set needs to be labeled, a picture set file with labeled information is automatically generated, and the picture set file comprises names, paths, labels and corresponding human-shaped boundary frames of corresponding pictures, and the image labels are shown in figure 6.

(4) And dividing the marked picture set according to the proportion of 4:1 of the training set and the test set, and respectively naming the picture set and putting the picture set and the picture set file with the marked information into a data set folder for training.

After the data set is prepared, a human shape recognition network model is obtained by training with a training data set under a PyTorch framework according to the following steps, and the network model is constructed according to the figure 6.

(1) And acquiring an original picture to be identified from a human-shaped picture data set folder, and performing data enhancement, self-adaptive bounding box calculation and picture scaling operation on the picture at an input end of a full convolution neural network model based on YOLOv5 s.

(1-1) performing data enhancement on an original picture to be recognized, namely: and randomly reading four pictures from the data set each time, respectively reducing the pictures according to the random proportion of 10-100%, and splicing the four reduced pictures into a new picture as a picture to be trained, so that the number of human-shaped targets with the area smaller than that of the full picture in the picture is increased and the type of the data set is enriched.

(1-2) performing adaptive bounding box calculation on the picture enhanced in the step (1-1), namely: firstly, in a YOLO algorithm, an initial boundary box is set according to the length-width ratio and the size of a humanoid target in a data set, then a humanoid recognition network outputs a prediction boundary box according to the initial boundary box in a training process, and the self-adaptive updating of the boundary box is realized by utilizing the difference between the prediction boundary box and a labeling boundary box to carry out back propagation.

(1-3) performing scaling operation on the pictures to be trained uniformly for training, namely: giving the length and width dimensions of a training picture as 608 multiplied by 608, dividing the given length and width dimensions by the length and width dimensions of the picture to be trained to respectively obtain length and width scaling coefficients, comparing the two scaling coefficients, and multiplying the length and width dimensions of the picture to be trained by the minimum scaling coefficient to obtain the scaled dimensions; at this time, the picture to be trained is reduced to the state that the long side of the two sides is equal to the given training picture size, the short side size is smaller than the given training picture size, the given short side size is used for subtracting the short side size to obtain a blank size, the blank size is divided by 2 to obtain the sizes to be filled at the two ends of the short side respectively, and the RGB colors (114, 114, 114) are used for filling.

(2) And (3) inputting the picture to be recognized processed in the step (1) as an input signal to a Focus focusing module in the backhaul network unit, preliminarily extracting complete picture information, and performing model training as shown in fig. 6.

(2-1) receiving the picture to be recognized with the given size obtained in the step (1) by an input end of a slicing processing unit, and performing slicing operation on the picture, wherein the slicing process is as shown in fig. 3; obtaining four complementary pictures by a method of obtaining pixel point information by pixel separation in the received picture to be identified with the given size obtained in the step (1), namely: and (3) respectively extracting and distributing adjacent pixel point information in the picture to be recognized to different slice pictures one by one, and finally enabling four pixel points in any 2 x 2 pixel region of the picture to be recognized to respectively belong to four different slice pictures, so that four complementary slice pictures with areas reduced by four times and different areas and containing all information of the picture to be recognized can be obtained.

(2-2) splicing the slice images processed in the step (2-1) by a tensor splicing unit I in the Focus focusing module to obtain a characteristic diagram with the length and width reduced to a half of the image to be identified and the number of channels increased to four times of the image to be identified, namely: and splicing the four sliced complementary pictures by a tensor splicing unit I to obtain a characteristic picture with the length and width reduced to a half of the picture to be identified and the number of channels expanded to four times of the picture to be identified.

(2-3) processing the characteristic diagram obtained in the step (2-2) by a CBL unit I: carrying out convolution operation on the feature map once by a convolution layer to obtain a double-time down-sampling feature map without information loss, wherein the size of the feature map is 304 x 304, optimizing feature distribution by a batch normalization layer by using the mean value and the variance of data, and finally completing the nonlinear mapping of the feature map by an activation function layer.

The activation function layer uses the Leaky ReLU function to assign a non-zero slope a to all negative values_iThe positive value is kept unchanged, so that the nonlinear characteristic is introduced, and the Leaky ReLU function expression is shown as 2-1:

(3) The feature map extracted in the step (2) is sent to a subsequent module of the backhaul network unit for deep feature extraction, so that a first feature, a second feature and a third feature of the image are respectively obtained, and the structure is shown in fig. 2-a;

(3-1) sending the feature map processed in the step (2) into a CBLSP1_1 module, a CBLSP1_3 module I and a non-local attention mechanism module I to extract a first feature with the size of 76 x 76;

(3-2) sending the first features into a CBLSP1_3 module II and a non-local attention mechanism module II for weighting processing to extract second features, wherein the size of the second features is 38 x 38;

the non-local attention mechanism module I and the non-local attention mechanism module II have the same structure, a super-large scale receptive field is established by utilizing the product of channels, the global dependency relationship is captured, the characteristics of the whole graph are extracted by learning weight distribution, the structure is shown in figure 4, the two modules both use an embedded Gaussian function to represent similarity, and two embedding formulas (3-1) and (3-2) are used:

θ(x_i)＝W_θx_i (3-1)

φ(x_j)＝W_θx_j (3-2)

obtaining the embedded Gaussian function form as formula (3-3):

the final output signal y_iAs shown in the formula (3-4),

representing the sum of the similarity f for all j positions.

The structure of the non-local attention mechanism module I and the non-local attention mechanism module II is shown in the figure 2-e, and the specific steps of the characteristic processing are as follows:

(3-1-1) performing linear mapping on the input characteristic diagram X of the [ c, h, w ] structure by using 1X 1 convolution to obtain characteristics phi, theta and g with half of the number of channels

(3-1-2) using an embedded Gaussian function to perform inner product calculation attention on the feature phi and the transposed and rearranged feature theta, and normalizing to obtain an attention coefficient

(3-1-3) corresponding the attention coefficient to the feature g, the number of spread channels on the 1 × 1 convolution is used

And (3-3) sending the second features into a CBL unit and an SPP module for weighting processing, and extracting third features with the size of 19 multiplied by 19.

(4) The Neck network receives the three image features extracted in the step (3), a tensor splicing method is used for feature fusion of the three image features, diversity and robustness of the image features are improved, three feature images are obtained and are respectively recorded as a feature image I, a feature image II and a feature image III, and the structure is shown in FIG. 5;

(4-1) the third feature extracted in the step (3-3) is subjected to size expansion through an upsampling unit I and then is spliced with the second feature through a tensor splicing VI unit, the splicing result is subjected to upsampling and size expansion and then is spliced with the first feature extracted in the step (3-1) through a tensor splicing unit VII, then the feature map I is obtained through the convolution layer VII with the reduced channel number of 6, the size is 76 x 76, and the convolution size of the convolution layer VII is 1 x 1.

(4-2) the third feature extracted in the step (3-3) is subjected to size expansion by the up-sampling unit II, then the splicing result with the second feature is spliced with the feature map I subjected to size reduction by the convolutional layer VIII by the tensor splicing unit VIII, and then the feature map II is obtained by reducing the number of channels by the convolutional layer IX by 6, wherein the size is 38 x 38, the convolution size of the convolutional layer VIII is 3 x 3, and the convolution size of the convolutional layer IX is 1 x 1.

(4-3) the third feature extracted in the step (3-3) and the feature map with the reduced size under the convolution X are spliced by a tension splicing unit IX, the convolution layer X convolution size is 3X 3, then the feature map is reduced by 6 through a convolution layer XI, the size is 19X 19, and the convolution size of the convolution layer XI is 1X 1.

(5) And (3) the Head network receives the three characteristic images with the channel number of 6 obtained by processing in the step (4), tensor splicing is carried out on the three characteristic images to obtain a dimension matrix related to the spliced characteristic images, then all humanoid targets identified by the model are determined by calculating the loss of the bounding box and utilizing a non-maximum suppression method, the bounding box is drawn according to humanoid coordinates contained in the dimension matrix, finally, pictures marked with all humanoid bounding boxes are output, and meanwhile, the pictures marked by the model are reversely transmitted to the input end to carry out self-adaptive bounding box calculation, updating and predicting the positions of the bounding boxes and carrying out the next round of training.

For example: the splicing mode of the characteristic diagram is shown as the formula (5-1):

76×76×3×6+38×38×3×6+19×19×3×6＝22743×6 (5-1)

[t_x t_y t_w t_h n p] (5-2)

And (5) calculating the calculated loss in the step (5) by adopting GIoU _ loss as a loss function of the bounding box. The process of GIoU _ loss calculating loss is shown as formula (5-3):

(5-6) repeating the step (5-5) until all humanoid targets are found.

After the human shape recognition network model trained based on the non-local attention mechanism is obtained, the model can be called in engineering practice, and the human shape recognition result marked by the model can be obtained by inputting pictures or videos, and is used in engineering as shown in fig. 6.

Claims

1. A humanoid recognition system based on a non-local attention mechanism comprises a full convolution neural network model based on YOLOv5s and is composed of an input end, a backhaul network unit, a Neck network unit and a Head network unit, and is characterized in that the backhaul network unit is an improved backhaul structure, the non-local attention mechanism enhanced feature extraction effect is increased, and the humanoid recognition system is composed of a Focus focusing module, a CBLSP1_1 module, a CBLSP1_3 module, a CBL unit XI, an SPP module and a non-local attention mechanism module.

2. The human form recognition system based on non-local attention mechanism as claimed in claim 1, wherein there are two CBLSP1_3 modules, CBLSP1_3 module I and CBLSP1_3 module II; the number of the non-local attention mechanism modules is two, namely a non-local attention mechanism module I and a non-local attention mechanism module II; the Focus module, the CBLSP1_1 module, the CBLSP1_3 module I and the non-local attention mechanism module I are connected in series in a data connection mode, the input end of the Focus module receives an original image signal to be classified, and the output end of the Focus module outputs extracted image features which serve as first features of an image; the CBLSP1_3 module II and the non-local attention mechanism module II are connected in a serial data mode, the input end receives a first characteristic signal of an image, and the output end outputs the image characteristic after weighting processing and takes the image characteristic as a second characteristic of the image; the CBL unit VIII and the SPP module are connected in a serial data mode, the input end of the CBL unit VIII receives a second characteristic signal of the image, the output of the CBL unit VIII is an image characteristic obtained after the second weighting processing, and the image characteristic is used as a third characteristic of the image.

3. The humanoid recognition system based on the non-local attention mechanism of claim 2, characterized in that the CBLSP1_1 module is based on a cross-stage local network structure architecture and is composed of a CBL unit II, a CBL unit III, 1 residual component, a convolutional layer I and a convolutional layer II, a tensor splicing unit II, a batch normalization unit I, an activation function unit I, and a CBL unit IV; the input end of the CBL unit II receives a characteristic diagram extracted by the Focus focusing module, and the output end of the CBL unit II is respectively connected with the CBL unit III and the convolution layer II; the output end of the CBL unit III is sequentially connected with the input ends of a residual error assembly, a convolutional layer I and a tensor splicing unit II; the output end of the convolution layer II is connected with the input end of the tensor splicing unit II; the tensor splicing unit II is used for splicing images according to the image characteristics output by the convolution layer I and the convolution layer II, and the output end of the tensor splicing unit II is sequentially connected with the batch normalization unit I, the activation function unit I and the CBL unit IV; the output end of the CBL unit IV outputs the feature map extracted by the CBLSP1_1 module to a CBLSP1_3 module I;

the CBLSP1_3 module I is based on a cross-stage local network structure; the device consists of a CBL unit V, CBL unit VI, 3 residual component units, a convolutional layer III, a convolutional layer IV, a tensor splicing unit III, a batch normalization unit II, an activation function unit II and a CBL unit VII; the input end of the CBLSP1_3 module I is the input end of the CBL unit V, and is used for receiving the characteristic diagram output by the CBLSP1_1 module, and the output end of the CBLSP1_3 module I is respectively connected with the CBL unit VI and the convolutional layer IV; the output end of the CBL unit VI is sequentially connected with three residual error components and a convolution layer III in series; the input end of the tensor splicing unit III receives the image characteristic information output by the convolutional layer III and the convolutional layer IV, carries out image splicing on the image characteristic information, the output end of the tensor splicing unit III is sequentially connected with the batch normalization unit II, the activation function unit II and the CBL unit VII, and finally outputs a characteristic diagram further extracted by the CBLSP1_3 module I to the non-local attention mechanism module I;

4. The human form recognition system based on the non-local attention mechanism according to claim 3, wherein the Focus module is formed by sequentially connecting a slice processing unit, a tensor splicing unit I and a CBL unit I, and more complete picture information can be reserved by using the module compared with a conventional convolution operation;

5. The human form recognition system based on the non-local attention mechanism as claimed in claim 4, wherein the maximum pooling unit is composed of three maximum pooling modules, and can use three maximum pooling sizes to perform feature acquisition on the feature map and process the feature map into the feature map with the same size and dimension;

6. A human shape recognition method based on a non-local attention mechanism is characterized by comprising the following steps:

7. The human shape recognition method based on the non-local attention mechanism as claimed in claim 6, wherein the original picture to be recognized in step (1) is from a human shape picture data set, and the data set is composed of a picture and human shape target labeling information; the picture is obtained by selecting a scene to be measured by a user for live-action shooting, wherein the picture is required to be not less than 10000 pieces and the resolution ratio is more than 608 multiplied by 608;

the specific treatment process of the step (1) comprises the following steps:

(1-3) uniformly carrying out zooming operation on the pictures to be trained obtained in the step (1-1) for training, namely: giving the length and width dimensions of the training picture, dividing the given length and width dimensions by the length and width dimensions of the picture to be trained to respectively obtain length and width scaling coefficients, comparing the two scaling coefficients, and multiplying the length and width of the picture to be trained by the minimum scaling coefficient to obtain the scaled dimensions; at the moment, the picture to be trained is reduced to the state that the long side of the two sides is equal to the size of the given training picture, the size of the short side is smaller than the size of the given training picture, the size of the short side is subtracted from the size of the given short side to obtain the blank size, the blank size is divided by 2 to obtain the sizes to be filled at the two ends of the short side respectively, and RGB colors are used for filling;

8. The human form recognition method based on the non-local attention mechanism according to claim 6, wherein the specific processing procedure of the step (2) comprises the following steps:

(2-1) receiving the picture to be identified with the given size obtained in the step (1) by the input end of the slicing processing unit, and carrying out slicing operation on the picture; the method specifically comprises the following steps: obtaining four complementary pictures by a method of obtaining pixel point information by pixel separation in the received picture to be identified with the given size obtained in the step (1), namely: respectively extracting and distributing adjacent pixel point information in the picture to be identified to different slice pictures one by one, and finally enabling four pixel points in any 2 x 2 pixel region of the picture to be identified to respectively belong to four different slice pictures, so that four complementary slice pictures with areas reduced by four times and different from each other and containing all information of the picture to be identified can be obtained;

wherein is y_iOutput signal, x_iIs an input signal, with a non-zero slope a_iThe value range of (1, + ∞) and the feature map size after processing is 304 × 304.

9. The human form recognition method based on the non-local attention mechanism according to claim 6, wherein the specific processing procedure of the step (3) comprises the following steps:

θ(x_i)＝W_θx_i (3-1)

φ(x_j)＝W_θx_j (3-2)

obtaining the embedded Gaussian function form as formula (3-3):

the final output signal y_iAs shown in the formula (3-4),

representing the sum of the similarity f for all j positions.

(3-3) sending the second features into a CBL unit and an SPP module for weighting processing, and then extracting third features;

10. The human form recognition method based on the non-local attention mechanism according to claim 6, wherein the specific processing procedure of the step (4) comprises the following steps:

in the step (4), the first size of the feature map is 76 × 76, the second size of the feature map is 38 × 38, the third size of the feature map is 19 × 19, the convolution sizes of the convolutional layer VIII and the convolutional layer X are 3 × 3, and the convolution sizes of the convolutional layer VII, the convolutional layer IX and the convolutional layer XI are 1 × 1;

76×76×3×6+38×38×3×6+19×19×3×6＝22743×6 (5-1)

[t_x t_y t_w t_h n p] (5-2)

wherein, the first four digits are the boundary box coordinate of network prediction, t_x、t_yInformation representing the coordinates of the center of the human-shaped bounding box, t_w、t_hRepresenting the width and height information of the human-shaped bounding box; the fifth digit n represents the bounding box confidence; the last bit p represents the humanoid probability;

wherein L is_GIoUTo lose GIoU _ loss, A^cRepresenting the minimum circumscribed rectangle area of the two windows, B representing the total area of the two windows, IoU representing the overlapping degree of the two windows, which is the ratio of the overlapping area of the two windows to the total area, all the areas being calculated from the bounding box information contained in the dimensional matrix;

(5-6) repeating the step (5-5) until all humanoid targets are found;