CN116030507A

CN116030507A - Electronic equipment and method for identifying whether face in image wears mask

Info

Publication number: CN116030507A
Application number: CN202111231621.7A
Authority: CN
Inventors: 程云飞; 吴风炎; 张希; 衣佳政
Original assignee: Hisense Group Holding Co Ltd
Current assignee: Hisense Group Holding Co Ltd
Priority date: 2021-10-22
Filing date: 2021-10-22
Publication date: 2023-04-28

Abstract

The application discloses electronic equipment and a method for identifying whether a face in an image wears a mask or not, wherein the method comprises the following steps: the method comprises the steps of carrying out feature extraction on an image to be processed through a backbone network comprising at least two feature extraction layers, identifying whether a face in the image to be processed wears a mask or not based on N initial feature images obtained through extraction, outputting an identification result, carrying out feature extraction on a first global feature entering from a first entrance of the user, obtaining a second global feature, carrying out feature extraction on a first local feature entering from a second entrance of the user, obtaining a second local feature, carrying out fusion processing on the second local feature based on the second global feature, determining a third global feature output from a first exit of the user based on the second global feature, determining a third local feature output from a second exit of the user based on the fused second local feature, wherein the first exit is connected with the first entrance of a next feature extraction layer, and the second exit is connected with the second entrance of the next feature extraction layer.

Description

Electronic equipment and method for identifying whether face in image wears mask

Technical Field

The application relates to the technical field of image processing, in particular to electronic equipment and a method for identifying whether a mask is worn on a face in an image.

Background

Wearing the mask is an effective epidemic propagation chain blocking means under the global background of epidemic mats. For this reason, it is necessary to detect whether or not a passer-by wears a mask in many places. Currently, detection personnel are provided at each place, and the detection personnel observe whether the passers-by wear the mask or not and remind the passers-by when the person who does not wear the mask is found. The accuracy of the detection mode completely depends on detection personnel, fatigue can be caused by long-time work of the detection personnel, so that the detection accuracy is difficult to guarantee, the labor cost of the detection mode is relatively high, and the infection probability of the detection personnel is easily increased.

Disclosure of Invention

The embodiment of the application provides electronic equipment and a method for identifying whether a mask is worn on a face in an image, which are used for solving the problems of low detection accuracy, high detection cost and the like of manually judging whether the mask is worn in the related art.

In a first aspect, an embodiment of the present application provides an electronic device, including:

the communication unit is used for acquiring an image to be processed;

The processor is used for carrying out feature extraction on the image to be processed through a backbone network to obtain N initial feature images, the backbone network comprises at least two feature extraction layers, each feature extraction layer is provided with a first inlet, a second inlet, a first outlet and a second outlet, the first outlet is connected with the first inlet of the next feature extraction layer, and the second outlet is connected with the second inlet of the next feature extraction layer; each feature extraction layer is used for carrying out feature extraction on a first global feature entering from a first inlet of the device to obtain a second global feature, carrying out feature extraction on a first local feature entering from a second inlet of the device to obtain a second local feature, carrying out fusion processing on the second local feature based on the second global feature, determining a third global feature based on the second global feature, determining the third local feature based on the fused second local feature, outputting the third global feature from a first outlet of the device, outputting the third local feature from a second outlet of the device, and N initial feature graphs are N local features selected from each feature extraction layer, wherein N is an integer;

based on the N initial feature images, identifying whether a face in the image to be processed wears a mask or not;

And the output unit is used for outputting the identification result.

In some embodiments, the processor is specifically configured to convert the second local feature into a global feature when performing a fusion process on the second local feature based on the second global feature; performing fusion processing on the second global features based on the global features obtained through conversion; extracting the features of the second global features after the fusion treatment; converting the global features obtained by extraction into local features; and carrying out fusion processing on the second local feature based on the local feature obtained by conversion.

In some embodiments, when identifying whether the face in the image to be processed wears the mask based on the N initial feature images, the processor is specifically configured to perform fusion processing on the N initial feature images through a feature pyramid network to obtain N fusion feature images; performing convolution operation on each fusion feature map by adopting conventional convolution, converting the fusion feature map into global features, extracting features of the global features obtained by conversion, converting the global features obtained by extraction into local features, and performing fusion processing on convolution operation results and the local features obtained by conversion to obtain a target feature map; and based on the N target feature images, identifying whether the face in the image to be processed wears the mask.

In some embodiments, the processor is specifically configured to perform feature extraction on any global feature according to the following steps:

inputting the global features into a self-attention model for feature dependency analysis; carrying out fusion processing on the global features and the output results of the self-attention model to obtain global reference features; carrying out fusion processing on each feature vector in the global reference features through a multi-layer perceptron; and carrying out fusion processing on the global reference feature and the fusion processed global reference feature to obtain a global feature extraction result.

In some embodiments, the processor is specifically configured to perform feature extraction on the local features according to the following steps:

carrying out channel lifting processing on the local features through conventional convolution, carrying out convolution operation on the local features after the channel lifting processing by utilizing depth separable convolution and cavity convolution, carrying out fusion processing on each operation result, and carrying out channel descending processing on the local features obtained after the fusion processing through conventional convolution;

carrying out channel lifting processing on the local features after channel lifting processing by conventional convolution, carrying out convolution operation on the local features after channel lifting processing by depth separable convolution and cavity convolution respectively, carrying out fusion processing on each operation result to obtain local reference features, inputting the local reference features into an attention model to obtain weights of all channels in the local reference features, carrying out multiplication processing on data corresponding to the channels in the local reference features by using the weights of all the channels, and carrying out channel lifting processing on the local reference features after multiplication processing by conventional convolution to obtain local feature extraction results;

carrying out channel lifting processing on the local features through conventional convolution, carrying out convolution operation on the local features after channel lifting processing by using depth separable convolution and cavity convolution, carrying out fusion processing on each operation result to obtain local reference features, inputting the local reference features into an attention model to obtain weights of channels in the local reference features, carrying out multiplication processing on data corresponding to the channels in the local reference features by using the weights of each channel, and carrying out channel dropping processing on the local reference features after multiplication processing through conventional convolution;

carrying out channel lifting processing on the local features after channel lifting processing through phantom convolution, carrying out convolution operation on the local features after channel lifting processing through depth separable convolution to obtain intermediate features, inputting the intermediate features into an attention model to obtain weights of all channels in the intermediate features, carrying out multiplication processing on data corresponding to the channels in the intermediate features by using the weights of all the channels, and carrying out channel lifting processing on the intermediate features after multiplication processing through phantom convolution to obtain local feature extraction results.

In some embodiments, in determining a third global feature based on the second global feature, the processor is specifically configured to determine the second global feature as the third global feature; or, based on the second local feature, performing fusion processing on the second global feature, and determining the fused second global feature as the third global feature; or, based on the second local feature, performing fusion processing on the second global feature, and performing feature extraction on the fused second global feature to obtain the third global feature.

In some embodiments, in determining a third local feature based on the fused second local feature, the processor is specifically configured to determine the fused second local feature as the third local feature; or, performing convolution operation on the fused second local feature to obtain the third local feature.

In some embodiments, the image to be processed is an in-vehicle image, and when there is a face in the in-vehicle image that does not wear a mask, the processor is further configured to:

if the in-car image is acquired from the position of the car door when the car door is in an open state, tracking the face of the person not wearing the mask in the in-car image, and triggering a first alarm when the tracking result shows that the corresponding passenger enters the interior of the car;

If the in-vehicle image is acquired from the inside of the vehicle, determining the in-vehicle position corresponding to the face of the mask not worn in the in-vehicle image based on the position information of the face of the mask not worn in the in-vehicle image and the established corresponding relation between the position in the image and the in-vehicle position, and triggering a second alarm based on the determined in-vehicle position.

In a second aspect, an embodiment of the present application provides a method for identifying whether a face in an image wears a mask, including:

the method comprises the steps that feature extraction is carried out on an image to be processed through a backbone network to obtain N initial feature images, the backbone network comprises at least two feature extraction layers, each feature extraction layer is provided with a first inlet, a second inlet, a first outlet and a second outlet, the first outlet is connected with the first inlet of the next feature extraction layer, and the second outlet is connected with the second inlet of the next feature extraction layer; each feature extraction layer is used for carrying out feature extraction on a first global feature entering from a first inlet of the device to obtain a second global feature, carrying out feature extraction on a first local feature entering from a second inlet of the device to obtain a second local feature, carrying out fusion processing on the second local feature based on the second global feature, determining a third global feature based on the second global feature, determining the third local feature based on the fused second local feature, outputting the third global feature from a first outlet of the device, outputting the third local feature from a second outlet of the device, and N initial feature graphs are N local features selected from each feature extraction layer, wherein N is an integer;

and outputting the identification result.

In the embodiment of the application, feature extraction is performed on an image to be processed through a backbone network to obtain N initial feature images, whether a face in the image to be processed wears a mask is identified based on the N initial feature images, and an identification result is output, wherein the backbone network comprises at least two feature extraction layers, each feature extraction layer comprises a first inlet, a second inlet, a first outlet and a second outlet, the first outlet is connected with the first inlet of the next feature extraction layer, and the second outlet is connected with the second inlet of the next feature extraction layer; each feature extraction layer is used for carrying out feature extraction on a first global feature entering from a first inlet of the device to obtain a second global feature, carrying out feature extraction on a first local feature entering from a second inlet of the device to obtain a second local feature, carrying out fusion processing on the second local feature based on the second global feature, determining a third global feature based on the second global feature, determining the third local feature based on the fused second local feature, outputting the third global feature from a first outlet of the device, outputting the third local feature from a second outlet of the device, and N initial feature graphs are N local features selected from each feature extraction layer, wherein N is an integer. In this way, the global feature extraction and the local feature extraction are sequentially carried out on the image to be processed through the plurality of feature extraction layers, and the global features extracted by the layer are utilized to carry out fusion processing on the local features extracted by the layer in each feature extraction layer, so that the expression richness and the accuracy of the local features are improved, and whether the mask is worn by the face in the image to be processed or not is identified based on the local features with rich and accurate expression. In addition, the scheme does not need to be configured with detection personnel, so that the labor cost is reduced, and the infection probability is reduced.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute an undue limitation to the application. In the drawings:

fig. 1 is a schematic diagram of a network structure of a face detection algorithm according to an embodiment of the present application;

fig. 2a is a schematic structural diagram of a local feature extraction BLOCK (BLOCK a) according to an embodiment of the present application;

FIG. 2B is a schematic diagram of a structure of a local feature extraction BLOCK (BLOCK B) according to an embodiment of the present application;

FIG. 2C is a schematic diagram of a structure of a local feature extraction BLOCK (BLOCK C) according to an embodiment of the present application;

fig. 3a is a schematic structural diagram of a global feature extraction block (TransBlock) according to an embodiment of the present application;

fig. 3b is a schematic structural diagram of another TransBlock according to an embodiment of the present disclosure;

fig. 4a is a schematic structural diagram of a feature extraction layer according to an embodiment of the present application;

FIG. 4b is a schematic structural diagram of a feature extraction layer according to an embodiment of the present disclosure;

FIG. 4c is a schematic structural diagram of a feature extraction layer according to an embodiment of the present disclosure;

fig. 5 is a schematic structural diagram of a backhaul according to an embodiment of the present application;

Fig. 6 is a schematic structural diagram of an FPN according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of a converged network according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of an SSH according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of a Head according to an embodiment of the present application;

fig. 10 is a schematic structural diagram of an electronic device according to an embodiment of the present application;

FIG. 11 is a schematic diagram of a detection process according to an embodiment of the present disclosure;

FIG. 12 is a schematic diagram of a model migration process according to an embodiment of the present disclosure;

fig. 13 is a flowchart of a method for identifying whether a face in an image wears a mask according to an embodiment of the present application;

fig. 14 is a schematic hardware structure of still another electronic device according to an embodiment of the present application.

Detailed Description

In order to solve the problems of low detection accuracy, high detection cost and the like in the prior art that whether the mask is worn or not is manually judged, the embodiment of the application provides electronic equipment and a method for identifying whether the mask is worn on a face in an image.

The preferred embodiments of the present application will be described below with reference to the accompanying drawings of the specification, it being understood that the preferred embodiments described herein are for illustration and explanation only, and are not intended to limit the present application, and embodiments and features of embodiments of the present application may be combined with each other without conflict.

In order to facilitate understanding of the present application, the present application refers to the technical terms:

the global feature can be expressed in the form of a feature vector sequence, and each feature vector in the feature vector sequence corresponds to one image block in the image to be processed and is used for representing the association relationship between the image block and other image blocks.

A local feature, which may be represented as a feature map, each element in the feature map corresponds to an image block in the image to be processed, the element being used to characterize the image features of the image block.

Conventional convolution refers to convolution other than phantom convolution, hole convolution, and depth separable convolution.

It should be noted that the method for identifying whether the face wears the mask in the image provided by the embodiment of the application can be applied to a scene where whether the mask is required to be worn or not, such as whether the mask is worn by an inbound person in a subway station, a railway station, an airport or the like, whether the mask is worn by a passenger is detected on a bus, whether the mask is worn by a passer person is detected at an entrance of a mall, an office building or the like, and the like.

In order to accurately detect whether each face in an image wears a mask, the embodiment of the present application provides a face detection algorithm, referring to fig. 1, including four parts:

A Backbone network (Backbone), configured to perform feature extraction on an image to be detected, to obtain N initial feature graphs, where n=3 in fig. 1;

the feature pyramid network (Feature Pyramid Networks, FPN) is used for carrying out fusion processing on the 3 initial feature images with different depths obtained by the backstone so as to more effectively detect faces with different scales and output 3 fusion feature images;

each fusion network is used for carrying out global feature extraction and local feature extraction on the fusion feature map entering the fusion network, and carrying out fusion processing on the extracted local features by utilizing the extracted global features to obtain a target feature map;

and the single-stage headless network (Single Stage Headless, SSH) is used for carrying out convolution operation of different sizes on the target feature images obtained by the fusion network connected with the single-stage headless network, and carrying out fusion processing on each operation result so as to improve the feature expression accuracy of the target feature images.

A Head network (Head) for classifying and regressing a target feature map of SSH output associated with itself, the classification comprising: determining the probability that each image block in the image to be processed corresponds to a mask-worn face, a mask-not-worn face and a non-face respectively, wherein the regression comprises face position information regression and face key point information regression, the face position information regression comprises the position coordinates of a face frame, and the face key point information regression comprises the position coordinates of face key points.

Subsequently, comprehensively considering the probability of the face of the mask not worn corresponding to each image block in the image to be processed output by the plurality of heads, judging whether the face of the mask not worn exists in the image block, and determining the final position coordinates of the face frame in the image block based on the position coordinates of the face frame in the image block output by the plurality of heads when the face of the mask not worn exists in the image, and determining the final position coordinates of the face key points in the image block based on the position coordinates of the face key points in the image block output by the plurality of heads.

The four parts are described below.

1. Backbone

Fig. 2a is a schematic structural diagram of a local feature extraction BLOCK (BLOCK a) according to an embodiment of the present application, and a feature extraction process of the BLOCK a is as follows:

the input local features (expressed in the form of feature graphs) are subjected to up-channel processing through conventional convolution Conv2d_1x1, the feature graphs after up-channel processing are subjected to depth separable convolution Conv2d_dw operation and cavity convolution Conv2d_da operation respectively, the depth separable convolution operation result and the cavity convolution operation result are subjected to fusion ConCat processing, the feature graphs obtained after fusion processing are subjected to down-channel processing through conventional convolution Conv2d_1x1, and the feature graphs after down-channel processing are output.

Fig. 2B is a schematic structural diagram of another local feature extraction BLOCK (BLOCK B) according to an embodiment of the present application, where a feature extraction process of BLOCK B is as follows:

the method comprises the steps of carrying out channel lifting processing on input local features (represented by a feature map) through conventional convolution Conv2d_1x1, carrying out depth separable convolution operation Conv2d_dw and cavity convolution Conv2d_da operation on the feature map after channel lifting processing, carrying out fusion ConCat processing on a depth separable convolution operation result and a cavity convolution operation result, inputting the feature map obtained after fusion processing into an attention model (corresponding avg_pool-Conv2d-ReLU-Conv2d-Sigmoid structure), calculating the weight of each channel in the feature map, carrying out multiplication processing (corresponding to operations carried out by Scale) on data corresponding to the channel in the feature map through the weight of each channel, carrying out channel dropping processing on the feature map after multiplication processing through conventional convolution Conv2d_1x1, and outputting the feature map after channel dropping processing.

Fig. 2C is a schematic structural diagram of another local feature extraction BLOCK (BLOCK C) according to an embodiment of the present application, where a feature extraction process of the BLOCK C is as follows:

the method comprises the steps of carrying out channel lifting processing on input local features (expressed in the form of feature graphs) through phantom convolution Conv2d_go, carrying out depth separable convolution Conv2d_dw operation on the feature graphs after channel lifting processing, inputting the feature graphs obtained through the depth separable convolution operation into an attention model (corresponding to avg_pool-Conv2d-ReLU-Conv2d-Sigmoid structure), calculating weight of each channel, carrying out multiplication processing (corresponding to operations carried out by Scale) on data corresponding to the channel in the feature graphs through weight of each channel, carrying out channel dropping processing on the feature graphs after the multiplication processing through phantom convolution Conv2d_go, and outputting the feature graphs after the channel dropping processing.

Fig. 3a is a schematic structural diagram of a global feature extraction block (TransBlock) according to an embodiment of the present application, where the TransBlock performs global feature extraction by using a self-attention model and a multi-layer perceptron, and a feature extraction process of the TransBlock is as follows:

the method comprises the steps of carrying out normalization processing on input global features (expressed in the form of feature vector sequences), inputting the feature vector sequences after normalization processing into a self-attention model for feature dependency analysis, carrying out fusion processing (such as addition processing on elements at the same positions) on the input feature vector sequences and the feature vector sequences output by the self-attention model to obtain new feature vector sequences, carrying out normalization processing on the new feature vector sequences, carrying out fusion processing on each feature vector in the feature vector sequences after normalization processing through a multi-layer perceptron to improve feature expression accuracy of each feature vector, then carrying out fusion processing (such as addition processing on elements at the same positions) on the new feature vector sequences and the feature vector sequences output by the multi-layer perceptron, and outputting the feature vector sequences after fusion processing.

Fig. 3b is a schematic structural diagram of another TransBlock provided in an embodiment of the present application, where LayerNorm is a normalization layer, matMul is a matrix multiplication function, softmax is an activation function, dropOut is a discard layer, linear is a full connection layer, and GELU is an activation function. In fig. 3b, the feature vector sequence (denoted as a) output from the first LayerNorm is expanded into 3 feature vector sequences (denoted as A1, A2 and A3), the structure of MatMu1-softmax-Dropout-MatMul-Linear-Dropout corresponds to the self-attention model in fig. 3a, in the self-attention model, the first MatMul is used for multiplying the matrix formed by the two feature vector sequences obtained by expansion, such as A1 and A2 respectively (one of the matrices may be transposed first), so as to obtain the association degree of each feature vector in a with the rest feature vectors, the softmax is used for mapping the value in the association degree square obtained by multiplication into a number between 0 and 1, the second MatMul is used for multiplying the association degree square after the softmax function processing with the matrix formed by the feature vector sequence not used by the first MatMul, i.e. the matrix formed by A3, and the Linear is used for fully connecting the feature vector sequences in the matrix obtained by multiplication, so as to prevent the random training of the two self-attention model from being lost, and the random training of the neural model is prevented from being obtained. The structure of the Linear-GELU-Dropout-Linear-Dropout corresponds to the multi-layer perceptron in FIG. 3a, the first Linear is used for performing full connection processing on the input feature vector sequence, the GELU is used for adding nonlinear factors into the feature vector sequence after full connection so as to improve generalization capability, the second Linear is used for performing full connection processing on the feature vector sequence after the nonlinear factors are added, and two Dropout layers in the multi-layer perceptron are used for randomly throwing out partial neurons in the training process so as to prevent overfitting and improve the generalization capability of the finally obtained recognition model.

In order to obtain better feature extraction capability, the backsheene provided in the embodiment of the present application may include a plurality of feature extraction layers, where each feature extraction layer has a first inlet, a second inlet, a first outlet and a second outlet, and the first outlet of each feature extraction layer is directly or indirectly connected to the first inlet of the next feature extraction layer, and the second outlet of each feature extraction layer is directly or indirectly connected to the second inlet of the next feature extraction layer. Each feature extraction layer can perform feature extraction on a first global feature entering from a first inlet of the device to obtain a second global feature, perform feature extraction on a first local feature entering from a second inlet of the device to obtain a second local feature, perform fusion processing on the second local feature based on the second global feature, determine a third global feature based on the second global feature, determine the third local feature based on the fused second local feature, output the third global feature from a first outlet of the device, and output the third local feature from a second outlet of the device.

When the second local features are fused based on the second global features, the second global features can be directly used for fusing the second local features, so that the local features can be corrected by using the global features, the expression accuracy of the local features is improved, the calculated amount is smaller, and the recognition speed is improved. In addition, the second global feature may be fused based on the second local feature, and then feature extraction may be performed on the fused second global feature to obtain a new global feature, and the second local feature may be fused based on the new global feature. Therefore, the method is equivalent to correcting the global features by the local features, correcting the local features by the corrected global features, and enabling feature fusion to be more sufficient, so that the method is beneficial to improving the expression accuracy of the local features and the subsequent recognition accuracy of whether the mask is worn on the face in the image.

In the implementation, when the third global feature is determined based on the second global feature, the second global feature can be directly used as the third global feature, fusion processing can be performed on the second global feature based on the second local feature, the fused second global feature is used as the third global feature, fusion processing can be performed on the second global feature based on the second local feature, and feature extraction is performed on the fused second global feature to obtain the third global feature. In this way, the determination of the third global feature in a flexible and versatile manner may increase the flexibility of a single feature extraction layer.

Similarly, when the third local feature is determined based on the fused second local feature, the fused second local feature may be directly used as the third local feature, or convolution operation may be performed on the fused second local feature to obtain the third local feature. In this way, the determination of the third global feature in a flexible and versatile manner may increase the flexibility of a single feature extraction layer.

Fig. 4a is a schematic structural diagram of a feature extraction layer provided in this embodiment of the present application, where a first TransBlock is used to perform feature extraction on a 400×384 feature sequence input from a first inlet, a block is used to perform feature extraction on a 320×320×16 feature sequence input from a second inlet, a first conventional convolution conv2d—3x3 is used to perform up-channel processing on a 320×320×16 feature sequence output from the block a, so as to obtain a 320×320×384 feature sequence, on one hand, the feature sequence is converted (corresponding to Conv2Trans operation) to obtain a 400×384 feature vector sequence, on the other hand, a second conv2d—3x3 is used to perform convolution operation on the feature sequence of 400×384 feature vector obtained by conversion and the 400×384 feature vector sequence output from the first TransBlock are fused (for example, elements at the same position are added), the feature extraction is performed on the feature vector sequence of 400×384 obtained after the fusion by using the second TransBlock, on the one hand, the feature vector sequence of 400×384 obtained after the fusion is taken as the output of the first outlet, on the other hand, the feature vector sequence of 400×384 obtained after the extraction is subjected to stretching processing (corresponding to the operation of Trans2 Conv) to obtain a feature map of 320×320×384, the feature map is subjected to fusion processing (such as adding elements at the same position) with a feature map of 320×320×384 output by the second conv2d_3x3, and the feature map of 320×320×384 obtained after the fusion is subjected to convolution operation by using the third convolution conv2d_3x3, and the feature map of 320×320×16 obtained after the fusion is taken as the output of the second outlet, or the feature map of 320×320×384 can be directly taken as the output of the second outlet (shown by a dotted line in fig. 4 a). Wherein 400×384 represents 400 image blocks in the image to be processed, the feature vector of each image block is 384-dimensional, and 320×320×384 represents the number of channels with broad×high.

It should be noted that fig. 4a is only an example, the feature vector sequence entering the first exit may be the output of the first TransBlock in fig. 4a, the feature extraction layer may refer to fig. 4b, the feature vector sequence entering the first exit may be the input of the second TransBlock in fig. 4a, and the feature extraction layer may refer to fig. 4c. In fig. 4b and 4c, the feature map entering the second outlet may be the feature map of 320×320×384 after fusion, or may be the feature map of 320×320×384 after convolution operation.

Fig. 5 is a schematic structural diagram of a backhaul according to an embodiment of the present application, including a plurality of feature extraction layers shown in fig. 4, where a feature vector sequence entering a first entry of a first feature extraction layer is obtained by performing conversion processing (corresponding to operations performed by Conv2Trans (400, 384)) on a first feature map, and a feature map entering a second entry of the first feature extraction layer is obtained by performing convolution operation conv2d_3x3 (320, 320, 16) on an image to be processed. And fusion of local features and global features is realized through up-sampling and down-sampling between a transducer branch (corresponding to a left branch and used for extracting global features) and a convolutional neural network branch (corresponding to a right branch and used for extracting local features) in the same feature extraction layer so as to improve feature extraction capability. The different feature extraction layers are connected through a plurality of feature extraction structures and/or a plurality of convolution layers.

Overall, the Backbone can use multiple transblocks for global feature extraction, and can freely combine BlockA, blockB and BlockC for local feature extraction. For example, in fig. 5, the front feature extraction layer uses only the BlockA to perform local feature extraction, the middle feature extraction layer connects the BlockA and the BlockB in series to perform local feature extraction (the BlockA may be used to perform local feature extraction first, then the BlockB may be used to perform feature extraction on the extraction result of the BlockA, or the BlockB may be used to perform local feature extraction first, then the BlockA may be used to perform feature extraction on the extraction result of the BlockB), and the rear feature extraction layer connects the BlockB and the BlockC in series to perform local feature extraction (the BlockB may be used to perform local feature extraction first, and then the BlockC may be used to perform feature extraction on the extraction result of the BlockB). Thus, the method focuses on expanding local features in the front, focuses on local feature extraction capacity in the middle, focuses on improving feature extraction speed in the rear, and is beneficial to balancing conflict between operation amount and operation time. Also shown in fig. 5 are 3 initial feature graphs: selected positions of Feature Map1, feature Map 2, and Feature Map 3.

In addition, it should be noted that fig. 5 is only an example, and in practical applications, the structure of the plurality of feature extraction layers included in the backhaul may be different, that is, the plurality of feature extraction layers of the backhaul may be any combination of fig. 4a, fig. 4b, and fig. 4c, and each type of feature extraction layer may appear multiple times in the backhaul.

2. FPN (field programmable gate array)

Considering that the initial feature map output from the backface has higher feature complexity and lower spatial resolution of the image, in order to comprehensively consider features of different spaces, the FPN in the embodiment of the present application uses two paths, namely, a bottom-up path and a top-down path, to perform feature fusion.

Fig. 6 is a schematic structural diagram of an FPN provided in this embodiment of the present application, performing convolution Conv and UpSampling (UpSampling) on a Feature Map3 of 20×20×256 to obtain a Feature Map of 40×40×128, performing fusion Concat and Conv 5 processing on the Feature Map and a Feature Map2 of 40×40×128 to obtain a Feature Map of 40×40×64, performing convolution (Conv) and UpSampling (UpSampling) on the Feature Map to obtain a Feature Map of 80×80×64, performing fusion Concat and Conv 5 processing on the Feature Map and a Feature Map1 of 80×80×64 to obtain a Feature Map of 80×80×64, outputting the Feature Map as a Feature Map11 (i.e., a first fusion Feature Map), meanwhile, the Feature Map is downsampled (DownSampling) to obtain a Feature Map of 40×40×64, the Feature Map is fused with the previously obtained Feature Map of 40×40×64 and processed by conv×5 to obtain a Feature Map of 40×40×64, the Feature Map is output as Feature Map22 (namely a second fused Feature Map), meanwhile, the Feature Map is downsampled (DownSampling) to obtain a Feature Map of 20×20×64, and fused Concat and conv×5 processes are performed on the Feature Map and Feature Map3 to obtain a Feature Map33 (namely a third fused Feature Map) of 20×20×64 to be output.

Thus, the method is beneficial to accurately reserving space information and is beneficial to accurately carrying out face frame regression, so that the accuracy of wearing mask detection is improved.

3. Converged network

Each fusion feature map output through the FPN can better express the face features in a region, and in order to further improve the expression accuracy of the face features, the local features and the global features of the fusion feature map can be extracted by means of a fusion network, and the global features are used for correcting the local features.

Fig. 7 is a schematic structural diagram of a fusion network provided in this embodiment, taking Feature Map11 as an example, performing convolution processing on Feature Map11 by using a first convolution Conv and a second convolution Conv, performing conversion processing (corresponding to operation performed by Conv2 Trans) on a Feature Map of 80 x 384 obtained by the second convolution to obtain a Feature vector sequence of 400 x 384, then performing global Feature extraction on the Feature vector sequence by using a TransBlock, performing conversion processing (corresponding to operation performed by Trans2 Conv) on the extracted Feature vector sequence of 400 x 384 to obtain a Feature Map of 80 x 384, performing fusion processing (such as adding elements on the same position) on the Feature Map of 80 x 384 obtained by the first convolution, and performing processing on the Feature Map of 80 x 384 obtained by fusion by using a third convolution Con to obtain a Feature Map111 of 80 x 64 (i.e. a target Feature Map).

The Feature Map22 similarly obtained 40×40×64 Feature Map222, and the Feature Map33 similarly obtained 20×20×64 Feature Map333.

4. SSH (SSH)

The SSH can expand the context information of the pre-detection area, and the accuracy of face detection is improved.

Fig. 8 is a schematic structural diagram of an SSH provided in the embodiment of the present application, taking Feature Map111 as an example, feature Map111 is extracted by using Conv2d convolution of one 3*3 on the first branch, feature Map111 is extracted by using Conv2d convolution of three 3*3 on the second branch, feature Map111 is extracted by using Conv2d convolution of two 3*3 on the third branch, and then Feature Map 1111 of 80×80×64 is obtained by performing fusion Concat processing on Feature maps obtained by the three branches. The Feature Map222 gives a Feature Map2222 of 40×40×64, and the Feature Map333 gives a Feature Map3333 of 20×20×64.

Thus, the effect of the convolution of 5*5 and 7*7 is replaced by the convolution of 3*3, so that the receptive field can be enhanced, and the feature extraction effect can be improved.

5. Head

The recognition algorithm provided by the embodiment of the application is a multitasking convolutional neural network, and can output a face probability value, face frame coordinates and face key point coordinates at the same time. In this proposal, the face with the mask and the face without the mask are regarded as two different categories, and the non-face probability, the face probability with the mask and the face probability without the mask are outputted by performing convolution processing on the Feature Map.

Taking setting 2 anchor points in each image block of the image to be processed as an example, namely taking each anchor point as a center to respectively classify, position face frames and detect key points of faces, 3 classifications are adopted: the number of channels of the feature map obtained by classification is 6 (3 multiplied by 2), each face frame is represented by a central point, a width and a height when the face frame is positioned, the central point has 2 coordinates, each coordinate corresponds to one channel, the width and the height respectively correspond to one channel, the number of channels of the feature map obtained by positioning the face frame is 8 (4 multiplied by 2), each face corresponds to 5 key points when the face key point is detected, each key point has 2 coordinates, each coordinate corresponds to one channel, and the number of channels of the feature map obtained by the face key point detection is 20 (5 multiplied by 2).

Fig. 9 is a schematic structural diagram of a Head provided in the embodiment of the present application, taking Feature1111 as an example, performing convolution operation on Feature1111 by using three convolutions Con2d—1 x 1, to obtain Feature diagrams of 80 x 6, 80 x 8 and 80 x 20, classifying and identifying by using the Feature diagram of 80 x 6, to obtain a probability that each image block in the image to be processed does not include a face, a probability that each image block includes a face without a mask, and a probability that each image block includes a face with a mask, and carrying out face frame positioning by using the Feature images of 80 x 8 to obtain face frame position information contained in each image block in the image to be processed, and carrying out face key point recognition by using the Feature images of 80 x 20 to obtain face key point information contained in each image block in the image to be processed. Similar processing is also performed for Feature2222 and Feature3333, and will not be described in detail herein.

The network loss function in the embodiment of the application is a multi-objective loss function, which is respectively classified loss, face detection frame regression loss and face key point regression loss. The loss function is as follows:

wherein L is _cls Is a class loss function, p _i Is the predicted probability of a certain class in the ith image sample,

is the true value of this class in the labeled ith image sample；L _box Is the regression loss function of the face frame, t _i ＝{t _x ,t _y ,t _w ,t _h } _i Is the predicted position information of the face frame in the ith image sample, (t) _x ,t _y ) Representing the center point, t, of a face frame _w Representing the width of a face frame, t _h Representing the height of the face frame, and the same thing is->

Is the position information of the face frame in the marked ith image sample; l (L) _pts Is the regression loss function of key points of the human face, l _i ＝{l _x1 ,l _y1 ,…,l _x5 ,l _y5 } _i Is the coordinate information of five face key points in the predicted ith image sample, +.>

Is the coordinate information of the key points of the five faces in the marked ith image sample. Lambda (lambda) ₁ 、λ ₂ For example, the weight parameters are set to 0.25 and 0.1 respectively.

In order to achieve an end-to-end mask detection function, in the embodiment of the application, two face labeling types of a face with a mask and a face without the mask are added in a face detection task, and a loss function L is classified _cls Three classifications (non-face, face with mask and face without mask) which are cross entropy loss functions, and detection frame regression loss L _box Is a Smooth-L1 loss function, and the key points of the human face return loss L _pts But also a smoothl 1 loss function.

Having described the recognition algorithm used in the embodiments of the present application, the electronic device provided in the embodiments of the present application for implementing the above-described face recognition algorithm is described next.

Fig. 10 is a schematic structural diagram of an electronic device according to an embodiment of the present application, including:

a communication unit 1001 for acquiring an image to be processed;

the processor 1002 is configured to perform feature extraction on an image to be processed through a backbone network to obtain N initial feature maps, identify whether a face in the image to be processed wears a mask based on the N initial feature maps, where the backbone network includes at least two feature extraction layers, each feature extraction layer has a first inlet, a second inlet, a first outlet, and a second outlet, where the first outlet is directly or indirectly connected to the first inlet of a next feature extraction layer, and the second outlet is directly or indirectly connected to the second inlet of the next feature extraction layer; each feature extraction layer is used for carrying out feature extraction on a first global feature entering from a first inlet of the device to obtain a second global feature, carrying out feature extraction on a first local feature entering from a second inlet of the device to obtain a second local feature, carrying out fusion processing on the second local feature based on the second global feature, determining a third global feature based on the second global feature, determining the third local feature based on the fused second local feature, outputting the third global feature from a first outlet of the device, outputting the third local feature from a second outlet of the device, and N initial feature graphs are N local features selected from each feature extraction layer, wherein N is an integer;

In specific implementation, when the second local feature is fused based on the second global feature, the processor 1002 may convert the second local feature into an global feature, that is, compress a feature map corresponding to the second local feature into a feature vector sequence, then, fuse the second global feature based on the global feature obtained by conversion, extract the feature of the fused second global feature, convert the extracted global feature into a local feature, that is, expand the feature vector sequence corresponding to the extracted global feature into a feature map, and fuse the second local feature based on the local feature obtained by conversion. Therefore, the method is equivalent to that the second global feature is fused based on the second local feature, the accuracy of the global feature is improved, and the second local feature is fused based on the fused global feature, so that the richness and accuracy of the local feature are improved, and the feature fusion is more complete.

When determining the third global feature based on the second global feature, the processor 1002 may directly determine the second global feature as the third global feature, may perform fusion processing on the second global feature based on the second local feature, determine the fused second global feature as the third global feature, and may perform fusion processing on the second global feature based on the second local feature, and perform feature extraction on the fused second global feature to obtain the third global feature. In this way, the third global feature is determined in a flexible and changeable manner, which may promote flexibility of each feature extraction layer.

Similarly, when determining the third local feature based on the fused second local feature, the processor 1002 may directly determine the fused second local feature as the third local feature, or may perform a convolution operation on the fused second local feature to obtain the third local feature. In this way, the third local feature is determined in a flexible and versatile manner to promote flexibility of each feature extraction layer.

The processor 1002 is further configured to identify whether a face in the image to be processed wears a mask based on the N initial feature maps.

In specific implementation, the processor 1002 may perform fusion processing on N initial feature graphs through the feature pyramid network to obtain N fused feature graphs, then perform convolution operation on each fused feature graph by adopting conventional convolution, convert the fused feature graph into global features, perform feature extraction on the global features obtained by conversion, convert the global features obtained by extraction into local features, perform fusion processing on the convolution operation result and the local features obtained by conversion, obtain a target feature graph, and further identify whether a face in an image to be processed wears a mask based on the N target feature graphs.

In this way, the N initial feature graphs with different depths are fused by using the feature pyramid network, then local feature extraction (corresponding convolution operation) and global feature extraction are respectively carried out on each fused feature graph obtained by fusion, and the local feature extraction result (corresponding convolution operation result) is fused by utilizing the global feature extraction result, so that the feature expression accuracy of the obtained target feature graph is improved, and the recognition probability of whether the mask is worn on the face in the subsequent image to be processed is further improved.

In particular implementations, the processor 1002 may perform feature extraction on any global feature by:

inputting the global features into a pre-trained self-attention model to analyze the feature dependency relationship;

the output results of the global feature and the self-attention model are fused to obtain a global reference feature, wherein the output results of the self-attention model are the same as the number of rows and the number of columns of the global feature, and elements at the same positions corresponding to the global feature and the self-attention model are added to finish fusion of the global feature and the self-attention model;

each feature vector in the global reference feature is fused through a multi-layer perceptron, so that the feature expression of the fused feature vector on the corresponding image block is more accurate;

and carrying out fusion processing on the global reference feature and the fusion processed global reference feature so as to obtain a global feature extraction result, wherein the number of rows and the number of columns of the global reference feature are the same as each other, and adding the elements at the same positions corresponding to the global reference feature and the fusion processed global reference feature.

In particular implementations, the processor 1002 may perform feature extraction on the local features by:

Carrying out channel lifting processing on the local features through conventional convolution, carrying out convolution operation on the local features after the channel lifting processing by utilizing depth separable convolution and cavity convolution respectively, carrying out fusion processing on each operation result, and carrying out channel descending processing on the local features obtained after the fusion processing through conventional convolution;

carrying out channel lifting processing on the local features after channel lifting processing by conventional convolution, carrying out convolution operation on the local features after channel lifting processing by depth separable convolution and cavity convolution respectively, carrying out fusion processing on each operation result to obtain local reference features, inputting the local reference features into a pre-trained attention model to obtain weights of all channels in the local reference features, carrying out multiplication processing on data corresponding to the channels in the local reference features by the weights of all the channels, and carrying out channel lifting processing on the local reference features after multiplication processing by conventional convolution.

In this way, the method is equivalent to the method that the BlockA is used for extracting the local features and then the BlockB is used for extracting the features of the BlockA, and the BlockA is higher in feature expansion capability and the BlockB is higher in feature extraction capability, so that the method is beneficial to extracting more comprehensive local features from more expanded features.

carrying out channel lifting processing on local features by conventional convolution, carrying out convolution operation on the local features processed by the channel lifting processing by depth separable convolution and cavity convolution respectively, carrying out fusion processing on each operation result to obtain local reference features, inputting the local reference features into an attention model to obtain weights of all channels in the local reference features, carrying out multiplication processing on data corresponding to the channels in the local reference features by the weights of all the channels, and carrying out channel dropping processing on the local reference features after the multiplication processing by conventional convolution:

carrying out channel lifting processing on the local features after channel lifting processing through phantom convolution, carrying out convolution operation on the local features after channel lifting processing through depth separable convolution to obtain intermediate features, inputting the intermediate features into a pre-trained attention model to obtain weights of all channels in the intermediate features, carrying out multiplication processing on data corresponding to the channels in the intermediate features by using the weights of all the channels, and carrying out channel lifting processing on the multiplied intermediate features through phantom convolution.

In this way, the method is equivalent to the method that the BlockB is used for extracting the local characteristics and then the BlockC is used for extracting the characteristics of the extracted result of the BlockB, and the BlockB has stronger characteristic extracting capability and stronger characteristic compressing capability, so that the method is beneficial to obtaining better local characteristic extracting effect in shorter time.

An output unit 1003 for outputting the recognition result.

In some embodiments, the image to be processed is an in-vehicle image, when a face without a mask is present in the in-vehicle image, the processor 1002 is further configured to track the face without the mask in the in-vehicle image if the in-vehicle image is acquired from a door position when the door is in an open state, and trigger a first alarm when the tracking result indicates that a corresponding passenger enters the vehicle; if the in-vehicle image is acquired from the inside of the vehicle, determining the in-vehicle position corresponding to the face of the mask which is not worn in the in-vehicle image based on the position information of the face which is not worn in the in-vehicle image and the established corresponding relation between the position in the image and the in-vehicle position, and triggering a second alarm based on the determined in-vehicle position.

The output unit 1003 is further configured to output alarm information and/or a face image of the mask not worn, which is taken from the in-vehicle image, when the first alarm or the second alarm is triggered.

In the embodiment of the application, the global feature extraction and the local feature extraction are sequentially performed on the image to be processed through the plurality of feature extraction layers, the extracted local features are fused by using the global features extracted by the feature extraction layers, the richness and the accuracy of the local features can be improved, and whether the mask is worn by the face in the image to be processed or not can be identified based on the richer and accurate local features. In addition, the scheme does not need to be configured with detection personnel, so that the labor cost and the sensing probability are reduced.

The following describes the present application in a scenario of detecting a wearing mask On a bus, where the electronic device may be an On Board Unit (OBU).

Fig. 11 is a schematic diagram of a detection process provided in an embodiment of the present application, including three parts of recognition model training, model migration, and recognition model use, where the recognition model training includes: constructing a face detection data set and training a recognition model by using the face detection data set; the identifying model use includes: and migrating the trained recognition model to an OBU, and detecting whether passengers without a mask exist on the bus by the OBU.

These parts are described separately below.

1. And constructing a face detection data set.

In the related art, a face image sample is obtained from an open source face data set, but most of images in the open source face data set are color images and gray images collected under visible light, and do not contain face images with masks. The monitoring camera on the bus starts an infrared lamp at night, and an image obtained under the illumination condition is a typical near infrared image and cannot be directly obtained from the open source face data set. Therefore, it is first necessary to construct a mask face detection data set.

In specific implementation, bus monitoring videos of all provinces and cities in the whole country can be obtained, and a mask face detection data set is selected from the videos. For example, the images are selected according to four time periods of 5:00-7:00, 11:00-13:00, 17:00-19:00 and 21:00-23:00 respectively, and the selected images can cover shooting conditions in various weather conditions such as sunny days, cloudy days, rainy and snowy days and foggy days, and can cover four seasons such as spring, summer, autumn and winter. In addition, the bus monitoring videos comprise a door monitoring video and a carriage monitoring video, the OpenCV library can be used for reading video files, 1 frame of the door monitoring video is intercepted every 1s at each bus stop, and 1 frame of the carriage monitoring video is intercepted every 5 s.

2. Training recognition models using face detection data sets

In specific implementation, the labeling tool LableImg can be used for labeling the selected face data set image, and labeling information comprises: the positions of the five key points of the human face are the positions of the center points of the left eye, the right eye, the nose tip point and the left mouth corner point. And the selected face images are respectively marked into different categories according to the face with the mask and the face without the mask.

In addition, in order to enable the face detection data set to cover the face diversity in the real situation as much as possible, enhancement processing may be performed on the face image in the face detection data set. For example, brightness, contrast, chromaticity, saturation, and noise of the image are adjusted, and the image is flipped and rotated. Therefore, the trained recognition model has better robustness on images acquired in different scenes.

Wherein:

the formulas for adjusting the brightness and contrast of the image are as follows:

g(i,j)＝a*f(i,j)+b；

f (i, j) represents a pixel located in the ith row and jth column in the source image, g (i, j) represents a pixel located in the ith row and jth column in the adjusted image, a is a gain (gain) for controlling the contrast of the image, a >0, b is a bias (bias) for controlling the brightness of the image.

The steps of adjusting the chromaticity and saturation of the image are as follows:

1) The RGB image values are normalized to between 0, 1.

2) The normalized image is color space converted using OpenCV functions, for example, converting an RGB image into an HLS image.

3) The chrominance and saturation components of the HLS image are adjusted using linear transforms.

4) The adjusted image is converted into an RGB image.

Salt and pepper noise is white and black alternate bright and dark spot noise generated by an image sensor, a transmission channel, decoding processing and the like. Gaussian noise is noise generated by insufficient brightness, noise of each circuit component itself, long-term operation of the image sensor, and excessive temperature of the image sensor at the time of photographing. In the embodiment of the application, the spiced salt noise and the Gaussian noise are added in the face dataset image, so that the variability of the input image is increased, and the image in a real environment is simulated. Meanwhile, partial images in the face detection data set can be subjected to x-axis overturning, and affine transformation is performed on the partial images in the face detection data set so as to realize image rotation.

Finally, the size of the image in the face detection dataset is adjusted. For example, the size of all images in the training process is 640×640×3. The size of the image samples corresponds to the input layer size of the network. For this reason, scaling the images in the face detection dataset in different sizes, adjusting the face frame and the labels of the five face key points according to the different sizes after scaling, scaling the original coordinate positions to obtain the coordinate positions after scaling, and then clipping the images to adapt to the input size 640 x 640 of the network.

And then, training the network shown in the figure 1 by using the image samples in the labeled face detection data set to obtain a recognition model.

3. Identification model Int8 quantization and OUB migration

In the embodiment of the application, the model conversion of the recognition model under the neural network acceleration processing hardware unit is completed based on onnx, meanwhile, the Int8 quantization is carried out on the recognition model, the hardware resource consumption in the algorithm reasoning process is further reduced, and the wearing detection speed of the passenger mask is improved.

Fig. 12 is a schematic diagram of a model migration process provided in the embodiment of the present application, where a trained recognition model Pytorch model is first converted into an onnx model, and then the onnx model is converted into a nnie model by using a neural network acceleration processing hardware unit nnie. In order to reduce the consumption of hardware resources and improve the face detection speed of the mask, the Int8 quantization operation can be carried out on the nnie model, after the Int8 quantization operation, the model size is reduced to 30% of the original model, the precision of the model is basically unchanged, and finally, the model file quantized by the Int8 is loaded into the OBU.

4. OBU detects whether passenger who wears gauze mask on bus has

In the embodiment of the application, a multi-target tracking algorithm is utilized to dynamically track passengers who do not wear the mask. The relevance of the non-mask-wearing passengers in the continuous video frames is processed based on a Kalman filtering algorithm, relevance measurement is performed based on a Hungary algorithm, appearance characteristics of the non-mask-wearing passengers are extracted based on a convolutional neural network, the non-mask-wearing passengers are tracked by utilizing the motion characteristics and the appearance characteristics, and the robustness of a multi-target tracking algorithm in a target losing and target shielding scene is improved.

The process of real-time monitoring the wearing condition of the mask of the passenger entering the carriage by the multi-target tracking algorithm is described in detail below.

In the specific implementation, the acquired video frames are input into a recognition model, the recognition model outputs the face frame of the face with the mask and the face frame of the face without the mask in the current video frame, and then the dynamic tracking is carried out on the face targets of the plurality of face masks without the mask based on the face frame of the face without the mask. In general, different mask-unaided face targets are assigned different identification IDs. In the continuous video frames, the face targets of the non-mask-wearing person with different IDs are tracked, and whether the passengers without the mask enter the carriage area is detected.

The multi-target tracking algorithm is specifically as follows:

and processing the relevance of the face targets without the mask in the continuous video frames by using a Kalman filtering algorithm. And when the detection of the face of the mask not worn in the current video frame is finished and the detection of the next video frame is carried out, the motion prediction of the face frame of the mask not worn in the current video frame in the next video frame is finished through Kalman filtering. Generally, describing a face box requires four states: the abscissa of the center of the face frame, the ordinate of the center of the face frame, the size of the face frame, and the aspect ratio. The four states can describe basic information of a face frame, but cannot completely describe motion state information of a corresponding face target, so that the motion state information can be described by introducing the change amount information of the four states, namely, the change speed of an abscissa of the center of the face frame, the change speed of an ordinate of the center of the face frame, the change speed of the size of the face frame and the change speed of an aspect ratio. In the embodiment of the present application, it is assumed that the inter-frame displacement is linearly and uniformly varied, so that the state of each face object integrates the above-mentioned 8 pieces of state information.

The matching between the position predicted by the filtering algorithm and the position detected by the next frame is solved by using the Hungary algorithm, and the cost matrix is calculated by combining the appearance characteristic and the motion characteristic (the result of Kalman filtering prediction) of the face target without the mask, so that the target matching is performed by using the Hungary algorithm according to the cost matrix.

And measuring the distance between the predicted Kalman filtering information and the face frame of the mask which is not worn and is detected in the latest video frame based on the Markov distance, namely, the motion characteristic. The face box of the non-mask-wearing person detected from the video frame is input into a pre-trained neural network, and the neural network outputs a feature vector of the face target of the non-mask-wearing person, wherein the feature vector is a feature vector with a fixed dimension, such as 128 dimensions. And calculating the minimum cosine distance between the feature vector of the detected face frame without the mask in the current video frame and the feature vector of the corresponding face target in the previous video frame, namely, the appearance feature.

The motion and appearance features may then be weighted using the following formula:

c _i,j ＝λd ⁽¹⁾ (i,j)+(1-λ)d ⁽²⁾ (i,j)；

wherein d ⁽¹⁾ (i, j) is the mahalanobis distance, d ⁽²⁾ (i, j) is a cosine distance and λ is a predetermined weight coefficient. When λ=1, the multi-target matching pursuit is performed by only the motion feature, and when λ=0, the multi-target matching pursuit is performed by only the appearance feature.

When the passenger is detected to be in the carriage without the mask, the coordinate point of the passenger in the image and a preset carriage seat distribution diagram can be utilized to determine the seat of the passenger, and voice prompt and alarm are carried out.

Because the position of the monitoring camera at the upper door and the lower door of the bus carriage is fixed, in the video frames of the upper door and the lower door, marks for representing passengers entering the interior of the carriage from the upper door and the lower door are respectively preset, when the multi-target tracking algorithm tracks that the center point of the face frame of the mask which is not worn by a certain ID exceeds the marks, the corresponding passengers enter the interior of the carriage from the upper door and the lower door, voice alarm can be carried out, the face region of the mask which is not worn by the passengers in the video frames can be intercepted, and the intercepted images of the passengers which are not worn by the mask are displayed on a screen.

When the indication signal for closing the upper and lower doors is detected, a camera video frame in the carriage can be obtained, whether a passenger without a mask exists in the carriage or not is judged based on the obtained video frame, if so, the coordinate position of the passenger without the mask in the video frame is matched with the pre-established carriage seat position information, the real-time position of the passenger without the mask is obtained, voice alarm is carried out to remind the passenger on the corresponding position of wearing the mask, and the image of the passenger without the mask is displayed on the screen.

In the embodiment of the application, a face detection data set is formed by utilizing video frames on a bus collected under various conditions, a recognition model for recognizing whether a mask is worn on a face in an image is trained based on the face detection data set, the image obtained from the bus is monitored by utilizing the recognition model, and if a passenger who does not wear the mask exists in the image, an alarm is given and a face image of the passenger who does not wear the mask is displayed. Because the face detection data set approximates to the image in the real scene, the accuracy of the recognition model trained by the face detection data set is high, and the scheme does not need to be configured with detection personnel, so that the method is beneficial to reducing the labor cost and the infection probability.

Next, a method for identifying whether a mask is worn on a face in an image provided in an embodiment of the present application will be described. Fig. 13 is a flowchart of a method for identifying whether a face in an image wears a mask according to an embodiment of the present application, including the following steps.

In step S1301, an image to be processed is acquired.

In step S1302, feature extraction is performed on an image to be processed through a backbone network to obtain N initial feature graphs, where the backbone network includes at least two feature extraction layers, each feature extraction layer has a first inlet, a second inlet, a first outlet, and a second outlet, the first outlet is connected to the first inlet of a next feature extraction layer, and the second outlet is connected to the second inlet of the next feature extraction layer.

In specific implementation, each feature extraction layer is used for extracting features of a first global feature entering from a first inlet of the device to obtain a second global feature, extracting features of a first local feature entering from a second inlet of the device to obtain a second local feature, carrying out fusion processing on the second local feature based on the second global feature, determining a third global feature based on the second global feature, determining the third local feature based on the fused second local feature, outputting the third global feature from a first outlet of the device, outputting the third local feature from a second outlet of the device, and N initial feature graphs are N local features selected from the feature extraction layers, wherein N is an integer.

In some embodiments, the second local feature may be converted into a global feature, the second global feature is fused based on the converted global feature, feature extraction is performed on the fused second global feature, the extracted global feature is converted into a local feature, and the second local feature is fused based on the converted local feature, so that fusion processing of the second local feature based on the second global feature is completed.

In some embodiments, the second global feature may be determined to be a third global feature, the second global feature may be fused based on the second local feature, the fused second global feature may be determined to be a third global feature, the second global feature may be fused based on the second local feature, and feature extraction may be performed on the fused second global feature to obtain the third global feature.

In some embodiments, the fused second local feature may be determined as the third local feature, or a convolution operation may be performed on the fused second local feature to obtain the third local feature.

In step S1303, the N initial feature graphs are fused through the feature pyramid network, to obtain N fused feature graphs.

In step S1304, a convolution operation is performed on each fusion feature map by adopting a conventional convolution, the fusion feature map is converted into global features, feature extraction is performed on the global features obtained by conversion, the global features obtained by extraction are converted into local features, and fusion processing is performed on the convolution operation result and the local features obtained by conversion, so as to obtain a target feature map.

In step S1305, based on the N target feature maps, it is recognized whether the face in the image to be processed wears a mask.

In some embodiments, feature extraction is performed on any global feature according to the following steps:

and inputting the global features into the self-attention model for feature association relation analysis, carrying out fusion processing on the output results of the global features and the self-attention model to obtain global reference features, carrying out fusion processing on each feature vector in the global reference features through a multi-layer perceptron, and carrying out fusion processing on the global reference features and the global reference features after the fusion processing to obtain global feature extraction results.

In some embodiments, feature extraction is performed on any local feature according to any combination of the following:

mode one: and carrying out channel lifting processing on the local features through conventional convolution, carrying out convolution operation on the local features after the channel lifting processing by utilizing depth separable convolution and cavity convolution, carrying out fusion processing on each operation result, and carrying out channel descending processing on the local features obtained after the fusion processing through conventional convolution.

Mode two: carrying out channel lifting processing on the local features through conventional convolution, carrying out convolution operation on the local features after channel lifting processing by using depth separable convolution and cavity convolution, carrying out fusion processing on each operation result to obtain local reference features, inputting the local reference features into an attention model to obtain weights of all channels in the local reference features, carrying out multiplication processing on data corresponding to the channels in the local reference features by using the weights of all the channels, and carrying out channel dropping processing on the local reference features after multiplication processing through conventional convolution.

Mode three: the local features are subjected to channel lifting processing through phantom convolution, convolution operation is carried out on the local features subjected to channel lifting processing through depth separable convolution, intermediate features are obtained, the intermediate features are input into an attention model, weights of all channels in the intermediate features are obtained, the weight of each channel is utilized to multiply data corresponding to the channels in the intermediate features, and channel dropping processing is carried out on the multiplied intermediate features through phantom convolution.

In step S1306, the recognition result is output.

When the image to be processed is an in-vehicle image and a face without a mask exists in the in-vehicle image, if the in-vehicle image is acquired from a vehicle door position when the vehicle door is in an open state, the face without the mask in the in-vehicle image is tracked, and when a tracking result shows that a corresponding passenger enters the vehicle, a first alarm is triggered, so that alarm information and/or the face image without the mask intercepted from the in-vehicle image are output; if the in-vehicle image is obtained from the interior of the vehicle, determining the in-vehicle position corresponding to the face of the mask not worn in the in-vehicle image based on the position information of the face of the mask not worn in the in-vehicle image and the corresponding relation between the position in the image and the in-vehicle position, triggering a second alarm based on the determined in-vehicle position, and further outputting alarm information and/or the face image of the mask not worn in the in-vehicle image.

Fig. 14 is a schematic structural diagram of an electronic device according to an embodiment of the present application, where the electronic device includes a transceiver 1401 and physical devices such as a processor 1402, where the processor 1402 may be a central processing unit (Central Processing Unit, CPU), a microprocessor, an application specific integrated circuit, a programmable logic circuit, a large-scale integrated circuit, or a digital processing unit. The transceiver 1401 is used for data transmission and reception between the electronic device and other devices.

The electronic device may further comprise a memory 1403 for storing software instructions to be executed by the processor 1402, and of course some other data required by the electronic device, such as identification information of the electronic device, encryption information of the electronic device, user data, etc. The Memory 1403 may be a Volatile Memory (RAM) such as Random-Access Memory (RAM); the Memory 1403 may also be a Non-Volatile Memory (Non-Volatile Memory), such as a Read-Only Memory (ROM), a Flash Memory (Flash Memory), a Hard Disk (HDD) or a Solid State Drive (SSD), or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited thereto. The memory 1403 may be a combination of the above memories.

The specific connection medium between the processor 1402, the memory 1403, and the transceiver 1401 is not limited in the embodiments of the present application. In the embodiment of the present application, the memory 1403, the processor 1402 and the transceiver 1401 are only illustrated in fig. 14 by way of example, and the bus 1404 is shown in bold in fig. 14, and the connection manner between other components is only illustrated schematically, but not limited thereto. The buses may be classified as address buses, data buses, control buses, etc. For ease of illustration, only one thick line is shown in fig. 14, but not only one bus or one type of bus.

The processor 1402 may be dedicated hardware or a processor running software, and when the processor 1402 may run software, the processor 1402 reads the software instructions stored in the memory 1403 and performs the method of recognizing whether the face in the image is wearing the mask as referred to in the foregoing embodiment under the driving of the software instructions.

The embodiment of the application also provides a storage medium, and when the instructions in the storage medium are executed by a processor of the electronic device, the electronic device can execute the method for identifying whether the face in the image is wearing the mask or not.

In some possible embodiments, various aspects of the method for identifying whether a face in an image wears a mask provided in the present application may also be implemented in a form of a program product, where the program product includes program code, where the program code is configured to cause an electronic device to perform the method for identifying whether a face in an image is wearing a mask as referred to in the foregoing embodiments when the program product is run on the electronic device.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium can be, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium would include the following: an electrical connection having one or more wires, a portable disk, a hard disk, a RAM, a ROM, an erasable programmable read-Only Memory (EPROM), flash Memory, optical fiber, compact disc read-Only Memory (Compact Disk Read Only Memory, CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The program product for image processing in embodiments of the present application may take the form of a CD-ROM and include program code that can run on a computing device. However, the program product of the present application is not limited thereto, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The readable signal medium may include a data signal propagated in baseband or as part of a carrier wave with readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, radio Frequency (RF), etc., or any suitable combination of the foregoing.

Program code for carrying out operations of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device, partly on a remote computing device, or entirely on the remote computing device or server. In cases involving remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, such as a local area network (Local Area Network, LAN) or wide area network (Wide Area Network, WAN), or may be connected to an external computing device (e.g., connected over the internet using an internet service provider).

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the application.

It will be apparent to those skilled in the art that various modifications and variations can be made in the present application without departing from the spirit or scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims and the equivalents thereof, the present application is intended to cover such modifications and variations.

Claims

1. An electronic device, comprising:

the communication unit is used for acquiring an image to be processed;

And the output unit is used for outputting the identification result.

2. The electronic device of claim 1, wherein the processor is specifically configured to, in performing a fusion process on the second local feature based on the second global feature:

converting the second local feature into a global feature;

performing fusion processing on the second global features based on the global features obtained through conversion;

extracting the features of the second global features after the fusion treatment;

converting the global features obtained by extraction into local features;

and carrying out fusion processing on the second local feature based on the local feature obtained by conversion.

3. The electronic device of claim 1, wherein when identifying whether a face in the image to be processed is wearing a mask based on the N initial feature maps, the processor is specifically configured to:

the N initial feature images are fused through a feature pyramid network to obtain N fused feature images;

performing convolution operation on each fusion feature map by adopting conventional convolution, converting the fusion feature map into global features, extracting features of the global features obtained by conversion, converting the global features obtained by extraction into local features, and performing fusion processing on convolution operation results and the local features obtained by conversion to obtain a target feature map;

And based on the N target feature images, identifying whether the face in the image to be processed wears the mask.

4. An electronic device as claimed in claim 1 or 3, wherein the processor is specifically configured to perform feature extraction on any global feature according to the following steps:

inputting the global features into a self-attention model for feature dependency analysis;

carrying out fusion processing on the global features and the output results of the self-attention model to obtain global reference features;

carrying out fusion processing on each feature vector in the global reference features through a multi-layer perceptron;

and carrying out fusion processing on the global reference feature and the fusion processed global reference feature to obtain a global feature extraction result.

5. The electronic device of claim 1, wherein the processor is specifically configured to perform feature extraction on the local features according to the steps of:

Carrying out channel lifting processing on the local features after channel lifting processing by conventional convolution, carrying out convolution operation on the local features after channel lifting processing by depth separable convolution and cavity convolution respectively, carrying out fusion processing on each operation result to obtain local reference features, inputting the local reference features into an attention model to obtain weights of all channels in the local reference features, carrying out multiplication processing on data corresponding to the channels in the local reference features by using the weights of all the channels, and carrying out channel lifting processing on the local reference features after multiplication processing by conventional convolution to obtain local feature extraction results.

6. The electronic device of claim 1, wherein the processor is specifically configured to perform feature extraction on the local features according to the steps of:

7. The electronic device of claim 1, wherein in determining a third global feature based on the second global feature, the processor is specifically to:

determining the second global feature as the third global feature; or,

performing fusion processing on the second global feature based on the second local feature, and determining the fused second global feature as the third global feature; or,

and carrying out fusion processing on the second global features based on the second local features, and carrying out feature extraction on the fused second global features to obtain the third global features.

8. The electronic device of claim 1, wherein in determining the third local feature based on the fused second local feature, the processor is specifically configured to:

determining the fused second local feature as the third local feature; or,

and carrying out convolution operation on the fused second local features to obtain the third local features.

9. The electronic device of claim 1, wherein the image to be processed is an in-vehicle image, and wherein when there is a face in the in-vehicle image that is not wearing a mask, the processor is further configured to:

10. An image processing method, comprising:

and outputting the identification result.