WO2023221608A1 - 口罩识别模型的训练方法、装置、设备及存储介质 - Google Patents

口罩识别模型的训练方法、装置、设备及存储介质 Download PDF

Info

Publication number
WO2023221608A1
WO2023221608A1 PCT/CN2023/080248 CN2023080248W WO2023221608A1 WO 2023221608 A1 WO2023221608 A1 WO 2023221608A1 CN 2023080248 W CN2023080248 W CN 2023080248W WO 2023221608 A1 WO2023221608 A1 WO 2023221608A1
Authority
WO
WIPO (PCT)
Prior art keywords
model
training
picture
network
image
Prior art date
Application number
PCT/CN2023/080248
Other languages
English (en)
French (fr)
Inventor
孟海秀
万业聪
陈录城
施森闽
郑旭东
Original Assignee
卡奥斯工业智能研究院(青岛)有限公司
卡奥斯物联科技股份有限公司
海尔数字科技(青岛)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 卡奥斯工业智能研究院(青岛)有限公司, 卡奥斯物联科技股份有限公司, 海尔数字科技(青岛)有限公司 filed Critical 卡奥斯工业智能研究院(青岛)有限公司
Publication of WO2023221608A1 publication Critical patent/WO2023221608A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/161Detection; Localisation; Normalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • G06V40/171Local features and components; Facial parts ; Occluding parts, e.g. glasses; Geometrical relationships
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/172Classification, e.g. identification

Definitions

  • This application relates to the field of new generation information technology, and in particular to a training method, device, equipment and storage medium for a mask recognition model.
  • MTCNN Multi-task convolutional neural network
  • ROI Region of interest
  • SVM Support Vector Machine
  • This application provides a training method, device, equipment and storage medium for a mask recognition model, which is used to improve the accuracy of recognition of small targets that are far away, and at the same time solve the problem of slow running speed and long recognition cycle due to the large number of model parameters. question.
  • embodiments of the present application provide a training method for a mask recognition model, including:
  • the first training set is the first face mask data set
  • the images in the second training set are
  • the label of the block includes the position information of the image block in the picture and the category information of whether the face in the image block is wearing a mask;
  • the third training set is the second face mask data set
  • the images in the fourth training set are
  • the label of the block includes the position information of the image block in the picture and the category information of whether the face in the image block is wearing a mask;
  • the first preset model is pre-trained using the second training set to obtain a pre-trained model.
  • the first preset model includes a backbone network and a remaining part of the network. During the pre-training process, the parameters of the backbone network of the first preset model are freeze;
  • the second preset model includes a backbone network and the remaining network.
  • the parameters of the backbone network of the second training model are the backbone networks of the pretrained model.
  • parameters, the initial parameters of the remaining network of the second preset model are the parameters of the remaining network of the pre-trained model, and the parameters of the backbone network of the second preset model are frozen during the formal training process.
  • the first preset model adopts the YOLOV4-tiny network model
  • the second training set is used to pre-train the first preset model to obtain a pre-trained model, including:
  • the first number of image blocks in the second training set are input into the first preset model for training, and the training results are obtained;
  • parameters of the remaining part of the network of the first preset model are updated.
  • the second preset model adopts the YOLOV4-tiny network model, uses the fourth training set and the second preset model for formal training, and obtains a mask recognition model, including:
  • the first number of image blocks in the fourth training set are input into the second preset model for training, and the training results are obtained;
  • parameters of the remainder of the network of the second preset model are updated.
  • each picture in the first training set and the third training set is cut into multiple image blocks, including:
  • the coordinates of the frame of the image block are obtained according to the coordinates of the picture frame marked on the picture and the size of the sliding image frame.
  • each picture in the first training set and the third training set into multiple image blocks before cutting each picture in the first training set and the third training set into multiple image blocks, it also includes:
  • the data enhancement processing includes one or more of the following processes: randomly adjusting the image size, randomly adjusting the image contrast, randomly adjusting the image tone, and randomly adjusting the image. Brightness, randomly add noise to the picture, randomly change the color model of the picture, randomly crop the picture.
  • the size of the image block is 416*416 pixels
  • the backbone network includes 6 series-connected cross-stage partial CSP networks. Each CSP network is used to extract features from the input image;
  • the target CSP network in the backbone network is connected to the input of the CAT module, and the output of the CAT module is connected to the remaining network.
  • the feature maps extracted by the target CSP network are 26*26 pixels and 13*13 pixels.
  • the CAT module is used to connect the target CSP Feature map extracted by the network.
  • the proportion of face targets in some pictures in the first face mask data set is greater than a preset threshold, and the proportion of face targets in the remaining pictures is less than or equal to the preset threshold;
  • Each picture in the second face mask data set includes a proportion of face targets greater than a preset threshold.
  • a mask recognition method provided by an embodiment of the present application is applied to the mask recognition model trained by the method described in the first aspect.
  • the mask recognition method includes:
  • the recognition result is used to indicate whether the face in the image block is wearing a mask
  • the face target is calculated based on the position information of each image block where the face target is located in the picture to be recognized, and the face target detection frame in each image block. Confidence of each image block where the face target is located, and select the first recognition result of the image block with the highest confidence as the recognition result of the face target;
  • the confidence of each image block where the face target is located is calculated based on the position information of each image block where the face target is located in the picture to be recognized, and the face target detection frame in each image block.
  • the calculated ratio is used as the confidence of the image patch.
  • an embodiment of the present application provides a training device for a mask recognition model, including:
  • the first cutting module is used to cut each picture in the first training set into multiple image blocks, and label the image blocks to obtain a second training set.
  • the first training set is the first face mask data set.
  • the label of the image block in the second training set includes the position information of the image block in the picture and the category information of whether the face in the image block is wearing a mask;
  • the second cutting module is used to cut each picture in the third training set into multiple image blocks, and label the image blocks to obtain the fourth training set.
  • the third training set is the second face mask data set.
  • the label of the image block in the fourth training set includes the position information of the image block in the picture and the category information of whether the face in the image block is wearing a mask;
  • the pre-training module is used to pre-train the first pre-set model using the second training set to obtain the pre-training model.
  • the first pre-set model includes the backbone network and the remaining part of the network. During the pre-training process, the first pre-set model The parameters of the backbone network are frozen;
  • the formal training module is used to conduct formal training using the fourth training set and the second preset model to obtain a mask recognition model.
  • the second preset model includes a backbone network and the remaining part of the network.
  • the parameters of the backbone network of the second training model are:
  • the parameters of the backbone network of the pre-trained model, and the initial parameters of the remaining network of the second pre-set model are the parameters of the remaining network of the pre-trained model, and the parameters of the backbone network of the second pre-set model are frozen during the formal training process.
  • the first preset model uses the YOLOV4-tiny network model, pre-training Practice module, specifically used for:
  • the first number of image blocks in the second training set are input into the first preset model for training, and the training results are obtained;
  • parameters of the remaining part of the network of the first preset model are updated.
  • the second preset model uses the YOLOV4-tiny network model, a formal training module, specifically used for:
  • the first number of image blocks in the fourth training set are input into the second preset model for training, and the training results are obtained;
  • parameters of the remainder of the network of the second preset model are updated.
  • first cutting module and the second cutting module are specifically used for:
  • the coordinates of the frame of the image block are obtained according to the coordinates of the picture frame marked on the picture and the size of the sliding image frame.
  • the first cutting module and the second cutting module also include:
  • the enhancement unit is used to perform data enhancement processing on the images in the first training set and the third training set.
  • the data enhancement processing includes one or more of the following processes: randomly adjusting the image size, randomly adjusting the image contrast, and randomly adjusting the image. Hue, randomly adjust the brightness of the picture, randomly add noise to the picture, randomly change the color model of the picture, and randomly crop the picture.
  • the size of the image block in the first cutting module or the second cutting module is 416*416 pixels
  • the backbone network in the pre-training module or formal training module includes 6 series-connected cross-stage partial CSP networks, each A CSP network is used to extract features from the input image;
  • the target CSP network in the backbone network is connected to the input of the CAT module.
  • the output of the CAT module is connected to the remaining network.
  • the feature maps extracted by the target CSP network are 26*26 pixels and 13*13 pixels.
  • the CAT module is used to connect the target Feature map extracted by CSP network.
  • the proportion of face targets in some pictures in the first face mask data set is greater than a preset threshold, and the proportion of face targets in the remaining pictures is less than or equal to the preset threshold;
  • Each picture in the second face mask data set includes a proportion of face targets greater than a preset threshold.
  • the mask identification device includes:
  • a cutting module used to cut the picture to be recognized into multiple image blocks and determine the position information of each image block in the picture to be recognized, where the picture to be recognized contains at least one face target;
  • the input module is used to input multiple image blocks into the mask recognition model to obtain the first recognition result of each image block.
  • the first recognition result is used to indicate whether the face in the image block wears a mask;
  • the calculation module when the same face target in the picture to be recognized exists in different image blocks, based on the position information of each image block where the face target is located in the picture to be recognized, and the face target detection frame in each image block , calculate the confidence of each image block where the face target is located, and select the first recognition result of the image block with the highest confidence as the recognition result of the face target;
  • the output module is used to output the recognition results of the face target in the image to be recognized.
  • the computing module is specifically used for:
  • the calculated ratio is used as the confidence of the image patch.
  • an electronic device for mask recognition includes:
  • the memory stores instructions that can be executed by at least one processor, and the instructions are executed by at least one processor, so that at least one processor can execute the training method of the mask recognition model provided by the first aspect of this application.
  • a sixth aspect is a computer-readable storage medium.
  • Computer-executable instructions are stored in the computer-readable storage medium. When the computer-executable instructions are executed by a processor, they are used to implement the training method of the mask model provided in the first aspect of this application.
  • This application provides a training method, device, equipment and storage medium for a mask recognition model.
  • the amount of model parameters is reduced, thereby speeding up the running speed. , shortening the recognition cycle and improving the accuracy of recognition of small targets that are far away.
  • Figure 1 is a schematic structural diagram of the YOLOV4-tiny network model of this application.
  • Figure 2 is a schematic structural diagram of the CSP network in the YOLOV4-tiny network model of this application;
  • Figure 3 is a flow chart of a training method for a mask recognition model provided in Embodiment 1 of the present application;
  • Figure 4 is a schematic diagram of image cutting
  • Figure 5 is a schematic diagram of image block position information
  • Figure 6 is a flow chart of a pre-training method for a mask recognition model provided in Embodiment 2 of the present application;
  • Figure 7 is a flow chart of a formal training method for a mouth recognition model provided in Embodiment 3 of the present application.
  • Figure 8 is a flow chart of a mask identification method provided in Embodiment 4 of the present application.
  • Figure 9 is a schematic structural diagram of a training device for a mask recognition model provided in Embodiment 5 of the present application.
  • Figure 10 is a schematic structural diagram of a mask identification device provided in Embodiment 6 of the present application.
  • Figure 11 is a schematic structural diagram of a mask identification device provided in Embodiment 7 of the present application.
  • the embodiment of the present application provides a training method and a method of using a mask recognition model.
  • the mask recognition model is used to identify whether a face is wearing a mask.
  • the mask recognition model can adopt the YOLOV4-tiny network model.
  • Figure 1 shows the YOLOV4-tiny of the present application. Structural diagram of the network model.
  • the YOLOV4-tiny network model mainly consists of the backbone network 107, the Concat connection module 108 and the remaining network 109.
  • the backbone network 107 is a series of six cross-stage partial networks (CSP networks), as shown in the figure 101 to 106. Each CSP network is used to extract features of the input image blocks.
  • CSP networks cross-stage partial networks
  • the feature maps extracted by the CSP network have different sizes. According to the series sequence, the feature maps extracted by the CSP network gradually decrease.
  • the last two levels of CSP networks are connected to the input ends of the Concat connection module 108. These two levels
  • the target feature map extracted by the CSP network enters the remaining part network 109 through the output end of the Concat connection module 108.
  • the remaining part network 109 is a convolution layer that performs convolution processing on the input target feature map.
  • the feature maps extracted sequentially through six cross-stage partial CSP networks connected in series by the backbone network are 208*208 pixels, 104*104 pixels, 52*52 pixels, 26
  • the feature maps of *26 pixels and 13*13 pixels, 26*26 pixels and 13*13 pixels are input into the residual network through the Concat connection module. After the residual network processes the image blocks, the recognition results of the image blocks are finally output.
  • This application uses the YOLOV4-tiny network model to identify whether the face target in the image to be recognized is wearing a mask.
  • This network model improves the running speed of the model by using multiple CSP networks with a skip connection structure.
  • FIG. 2 is a schematic structural diagram of the CSP network in the YOLOV4-tiny network model of this application.
  • the CSP network consists of an input module 201, an output module 206, convolutional layers 202, 203, 204, 205, and CAT modules 25 and 26.
  • the convolution kernel size of the convolution layer 202, the convolution layer 203 and the convolution layer 204 is 3 ⁇ 3
  • the convolution kernel size of the convolution layer 205 is 1 ⁇ 1.
  • the function of CAT modules 25 and 26 is to connect two arrays without changing the characteristics of the arrays.
  • the CSP network contains two skip connection structures, as shown in 207 and 208 in the figure. Its function is to divide the input feature map into two channels. After the input feature map passes through the skip connection structure, it only needs to be connected to one channel. Convolution calculation is performed on the feature map, thereby reducing the calculation amount of the model and speeding up the running speed of the computer.
  • the feature map output by the convolution layer 202 is divided into path 21 and path 22 on the channel when passing through the skip connection structure 207.
  • the feature map output by the path 21 is directly connected with the feature map output by the convolution layer 205.
  • the feature maps are concatenated in the CAT module 26 and entered into the output module 206.
  • the feature map output through path 22 enters the convolution layer 203.
  • the feature map output is divided into path 23 and path 24 on the channel when passing through the skip connection structure 208.
  • the features output through path 24 The map enters the convolution layer 204, and the feature map output through the path 23 and the feature map output through the convolution layer 204 are connected in the CAT module 25 and enter the convolution layer 205.
  • the input module 201 inputs a feature map of 64@104*104 pixels, enters the first convolution layer 202, performs convolution calculation on the feature map through the convolution function Conv3 ⁇ 3, and then outputs the feature map.
  • the output feature When the graph passes through the skip connection structure 207, the feature map is divided into two on the channel so that the channel of the feature map is halved. Then part of the feature map of 32@104*104 pixels enters the CAT module 26 through path 21, and the other part of the feature map of 32@104*104 pixels enters the convolution layer 203 through path 22, and is passed through the convolution function Conv3 ⁇ 3 The feature map is subjected to convolution calculation processing and the feature map is output.
  • the feature map is divided into two channels on the channel to halve the channels of the feature map. Then a part of the feature map of 16@104*104 pixels enters the CAT module 25 through path 23, and the other part of the feature map of 16@104*104 pixels enters the convolution layer 204 through path 24, and is passed through the convolution function Conv3 ⁇ 3
  • the feature map is output after performing convolution calculation processing.
  • the output feature map enters the CAT module 25 and is combined with the feature map input through the path 23.
  • the feature map is then input to the convolution layer 205, and the features are paired through the convolution function Conv1 ⁇ 1.
  • the feature map is input into the CAT module 26, and is combined with the feature map input through the path 21 in the CAT module 26, and then the feature map is output through the output module 206.
  • FIG 3 is a flow chart of a training method for a mask recognition model provided in Embodiment 1 of the present application. like As shown in Figure 3, the training method of the mask recognition model includes the following steps.
  • S301 Cut each picture in the first training set into multiple image blocks, and label the image blocks to obtain a second training set.
  • the first training set is the first face mask data set.
  • the proportion of face targets in some pictures in the first face mask data set is greater than the preset threshold, and the proportion of face targets in the remaining pictures is less than or equal to the preset threshold.
  • Set the threshold can be the face mask data set (FaceMask_CelebA), or the real mask face recognition data set in the mask-occluded face data set (Real-Word Masked Face Dataset, RMFD for short), One or more of the simulated mask face recognition data set and the real mask face verification data set.
  • the data set contains a large number of pictures and labels of the pictures.
  • the labels are category information of whether the face is wearing a mask. For example, when the face in the picture wears a mask, its category information is 1, and when the face does not wear a mask, its category information is 0.
  • the category information can also be expressed in other forms, which is not carried out in this embodiment. limit.
  • Each picture in the first training set is cut into multiple image blocks.
  • the size of the image blocks may be the same or different, which is not limited in this embodiment.
  • label the cut image blocks includes the position information of the image block in each picture and the category information of whether the face in the image block wears a mask.
  • the current recognition method of mask wearing has poor detection effect on small targets that are far away.
  • this application adopts the method of first amplifying and then detecting, that is, cutting each picture in the training set into multiple image blocks, and then detect the image blocks.
  • Figure 4 is a schematic diagram of image cutting.
  • an image frame 40 of a certain size is used to slide on the picture in sequence according to preset pixels, and the area in each image frame is cut out to obtain an image block, where the size of the preset pixel is smaller than the size of the image frame of a certain size. length and width.
  • the size of the image frame 40 may be 416*416, and the preset pixel size may be 300.
  • an image frame of a fixed size is used to slide sequentially from left to right by preset pixels, and the images in the image frame are cut to form multiple Image block.
  • the image frame slides to the rightmost end slide the image frame down one step according to the sliding step size, and then slide it from left to right or from right to left according to the preset pixels, and complete the image block in the above way. cutting.
  • the sliding step size is the size of multiple pixels, and the sliding step size and the preset pixels may be the same or different.
  • the image frame slides to the bottom, slide the image frame one step to the right according to the sliding step size, and then slide it from top to bottom or from bottom to top according to the preset pixels, and complete the cutting of the image block in the above way.
  • the sliding step size is the size of multiple pixels, and the sliding step size and the preset pixels may be the same or different.
  • the labels of the image blocks in the second training set include the position information of the image block in the picture and the category information of whether the face in the image block wears a mask.
  • the category of the image block The information is the same as the category information of the picture before cutting, that is, the category information of the face in the picture before cutting is wearing a mask, then the category information of the face in each image block after cutting is also wearing a mask. Among them, after the cutting is completed, some There may be no face in the image block, so the category information of the image block without the face is not wearing a mask.
  • the position information of the image block can be the position information of the cutting point of the image block in the picture, which can be obtained according to the coordinates of the picture before cutting.
  • Figure 5 is a schematic diagram of image block locations. As shown in Figure 5, point A (0, 0) of the picture is defined as the coordinate origin, and point B and point C are the position information of the first image block and the second image block in the picture respectively. It can be understood that the position of the image block here is not the physical coordinate of the image block, but the pixel position of the image block. For example, when the size of the cutting box is 416*416 and the sliding pixel is 300, the position information of the first image block is B (416, 416), and the position information of the second image block is C (716, 416) . By analogy, by performing a translation operation based on the coordinates of the image, the size of the image frame during cutting, and the sliding pixels, the position information of each image frame in the image can be obtained.
  • the data enhancement processing method can be one or more of the following processing methods: randomly adjusting the picture size, randomly adjusting the picture contrast, randomly adjusting the picture tone, randomly adjusting the picture brightness, randomly adding noise to the picture, randomly changing the picture Color models, randomly cropped images.
  • randomly adjusting the brightness and contrast of the picture can change the quality of the picture, so that the quality of the picture is consistent with the difference in imaging quality caused by environmental factors such as air quality in real shooting scenes; randomly cropping the picture can change the picture
  • the position of the face target in the picture is consistent with the depth of field changes in the foreground and background caused by changes in the position of the face target in the real scene; it can be seen that by enhancing the data set pictures, the The pictures in the final training set contain various problems that may affect imaging quality under normal viewing conditions.
  • S302 Cut each picture in the third training set into multiple image blocks, and label the image blocks to obtain a fourth training set.
  • the third training set is the second face mask data set. Specifically, each face mask data set in the second face mask data set The proportion of face targets included in the picture is greater than the preset threshold.
  • the pictures in this data set are obtained by staff through daily shooting. Compared with the pictures in the first face mask data set, the pictures in this data set are closer to surveillance. The pictures taken under the camera, that is, the face targets in the pictures are closer to the small targets. This application can improve the accuracy of small target detection by training the images in the non-public face training set.
  • the data set contains a large number of pictures and labels of the pictures.
  • the labels are category information of whether the face is wearing a mask. For example, when the face in the picture wears a mask, its category information is 1, and when the face does not wear a mask, its category information is 0.
  • the category information can also be expressed in other forms, which is not carried out in this embodiment. limit.
  • the cut image blocks are labeled to obtain the fourth training set.
  • the labels of the image blocks in the fourth training set include the position information of the image block in the picture and the category information of whether the face in the image block wears a mask.
  • the category of the image block The information is the same as the category information of the picture before cutting, that is, the category information of the face in the picture before cutting is wearing a mask, then the category information of the face in each image block after cutting is also wearing a mask. Among them, after the cutting is completed, some There may be no face in the image block, so the category information of the image block without the face is not wearing a mask.
  • step S301 the method of cutting the pictures in the third training set and the method of labeling the image block position information are the same as step S301, and will not be described again here.
  • step S301 before cutting each picture in the third training set into multiple image blocks, data enhancement processing needs to be performed on each picture in the third training set.
  • the enhancement processing method and its effect are the same as in step S301. There is no need here. Again.
  • S303 Use the second training set to pre-train the first preset model to obtain a pre-trained model.
  • the first preset model uses the YOLOV4-tiny network model.
  • the YOLOV4-tiny network model can adopt the structure shown in Figure 1. It can be understood that the network structure of the YOLOV4-tiny network model can be transformed. For example, the number of CSP networks is greater or less than Figure 1 shows the network model.
  • a certain number of image blocks in the second training set are input into the first preset model each time for model pre-training. For example, the number of image blocks input each time can be 16, and accordingly, 16 image blocks will be obtained.
  • the output result is used to indicate whether the face in the image block is wearing a mask.
  • the YOLO loss value is obtained based on the output results of the model training and the labels of the image blocks. Backpropagation is performed through the YOLO loss value, and the model parameters are iteratively updated using the gradient descent method until the end of the iteration to obtain the pre-trained model.
  • the first preset model includes a backbone network and the remaining part of the network, where the parameters of the backbone network of the first preset model are frozen during the pre-training process.
  • the freezing of the parameters of the backbone network means that the parameters of the backbone network are frozen during the iterative training process. Do not update, keep unchanged, only update the parameters of the remaining part of the network, freeze Combining the parameters of the backbone network can speed up the training process and reduce training time.
  • S304 Use the fourth training set and the second preset model for formal training to obtain a mask recognition model.
  • the second preset model uses the YOLOV4-tiny network model, and the YOLOV4-tiny network model can adopt the structure shown in Figure 1.
  • a certain number of image blocks in the fourth training set are input into the first preset model each time for model pre-training.
  • the number of image blocks input each time can be 16, and accordingly, 16 image blocks will be obtained.
  • the output result is used to indicate whether the face in the image block is wearing a mask.
  • the YOLO loss value is obtained based on the output results of the model training and the labels of the image blocks. Back propagation is performed through the YOLO loss value, and the model parameters are iteratively updated using the gradient descent method until the mask recognition model is obtained at the end of the iteration.
  • the second preset model includes a backbone network and a remaining network.
  • the parameters of the backbone network of the second preset model are the parameters of the backbone network of the pre-trained model.
  • the initial parameters of the remaining network of the second preset model are the pre-trained model.
  • the parameters of the remaining network can be set to a better state through pre-training, which can speed up the convergence of the model in formal training.
  • the parameters of the backbone network are frozen, and only the parameters of the remaining networks are updated.
  • the second training set and the fourth training set are obtained respectively, and by inputting the image blocks in the second training set into the first preset model Perform pre-training to obtain a pre-trained model. Based on the pre-trained model, input the image blocks in the fourth training set into the second preset model for formal training to obtain a mask recognition model.
  • This method improves the accuracy of recognition of small targets that are far away by cutting the picture into multiple image blocks and training and identifying small image blocks.
  • Figure 6 is a flow chart of a pre-training method for a mask recognition model provided in Embodiment 2 of the present application. This embodiment is a detailed description of step S303 in Embodiment 1. As shown in Figure 6, the pre-training method of the mask recognition model includes the following steps.
  • S601 Load the parameters trained on the ImageNet data set to the backbone network of the first preset model.
  • the values of parameters A, B, and C can be obtained and the parameter values are filled in the corresponding values of the backbone network. Location.
  • step S601 after each parameter is filled in the corresponding position of the backbone network in turn, freezing the backbone network means that the parameters of the backbone network part will not be changed during pre-training.
  • the first number may be 16 image patches.
  • the first number of image blocks is input each time until all the image blocks in the second training set are input into the YOLOV4-tiny network model.
  • the model is trained based on the first number of image blocks input each time to obtain each training result. And output the training results.
  • S604 Determine the YOLO loss value based on the label of the input image block and the training result of the input image block.
  • the label of the input image block is the category information of whether the face is wearing a mask.
  • the category information of wearing a mask is 1, and the category information of not wearing a mask is 0.
  • the training result of the image block is a real number between 0 and 1. For example, it can be 0.12 or 0.9.
  • the value is close to 1 it means that the face target of the image block wears a mask.
  • the value is close to 0 Indicates that the face target of this image block is not wearing a mask.
  • the difference between the label of the input image patch and the training result of the input image patch is used as the YOLO loss value.
  • S605 Perform backpropagation based on the YOLO loss value to obtain updated parameters of the remaining network of the first preset model.
  • Back propagation based on the YOLO loss value is to find the gradient of the YOLO loss value to each parameter of the model, and then use the gradient to update the parameters of the remaining network models in the first preset model through the gradient descent method.
  • Gradient descent is a kind of iterative method. Simply put, it is a method to find the minimum of the objective function. That is, when solving the minimum value of the loss function, the gradient descent method can be used to iterate step by step to obtain the minimized loss. Function and model parameter values.
  • S606 Use the update parameters to update the parameters of the remaining network of the first preset model.
  • step S608 is executed.
  • step S603 is returned to execution.
  • the iteration condition is, for example, completing a preset number of iterative trainings, which may be 100 or 120, etc., wherein one iterative training refers to training all the image blocks in the second training set at one time.
  • Each parameter of the model output after the pre-training is completed is sequentially filled in the corresponding position of the remaining network of the first preset network model for formal training of the second preset network model.
  • the parameters trained on the ImageNet data set are loaded onto the backbone network of the first preset network model, and the backbone network is frozen.
  • the first preset network model is pre-trained using the data of the second training set, and the YOLO loss value is obtained based on the difference between the label of the input image block and the training result of the input image block, and the YOLO loss value is back-propagated to obtain YOLO
  • the gradient of the loss value to the model parameters is then used to update the model parameters through the gradient descent method, and the pre-trained model is obtained through iteration.
  • this method can set the parameters of the first preset network model to a better state, thereby accelerating the convergence of the second preset network model in formal training.
  • Figure 7 is a flow chart of a formal training method for a mask recognition model provided in Embodiment 3 of the present application. This embodiment is a detailed description of step S304 in Embodiment 1. As shown in Figure 7, the formal training method of this mask recognition model includes the following steps.
  • the parameters on the backbone network in the pre-trained model obtained in Example 2 are sequentially filled in to the corresponding positions of the backbone network in the second preset model.
  • Freezing the backbone network in the second default model ensures that the parameters of the backbone network are not changed when training the remaining networks of the second default model.
  • S703 Load parameters of the remaining network of the pre-trained model onto the remaining network of the second preset model.
  • the first number of image blocks in the fourth training set is input into the second preset network model.
  • the specific first number and method of inputting the image blocks are the same as step S603 in the second embodiment, and will not be described again here.
  • S705 Determine the YOLO loss value based on the label of the input image block and the training result of the input image block.
  • the label information of the input image block is the same as the information of the image block in step S604 in the second embodiment. Here No longer.
  • the difference between the label of the input image patch and the training result of the input image patch is used as the YOLO loss value.
  • S706 Perform backpropagation based on the YOLO loss value to obtain updated parameters of the remaining network of the second preset model.
  • Back propagation based on the YOLO loss value is to find the gradient of the YOLO loss value to each parameter of the model, and then iterate through the gradient descent method and use the gradient to update the model parameters.
  • the relevant parameters of the training phase need to be set according to the specific training indicator requirements. These parameters include learning rate, number of iterations and attenuation strategies. By manually adjusting these relevant parameters, it can be accelerated.
  • the training of the model also makes the model parameters obtained by training more optimal, which can further improve the accuracy of model recognition.
  • S707 Use the update parameters to update the parameters of the remaining network of the second preset model.
  • step S709 is executed.
  • step S704 is returned to execution.
  • the iteration condition is, for example, that the YOLO loss value is reduced to the preset YOLO loss value, and the YOLO loss value no longer changes significantly during multiple consecutive training processes, where the change in the YOLO loss value can be judged by the variance.
  • the YOLO loss value continues to decrease during training.
  • the minimum YOLO loss value is obtained. At this time, the iteration is completed. Stop training.
  • the backbone network in the second preset model is frozen, where the parameters of the backbone network are the same as those in the pre-trained model.
  • Use the data of the fourth training set to formally train the YOLOV4-tiny network model, and obtain the YOLO loss value based on the difference between the label of the input image block and the training result of the input image block.
  • the YOLO loss value is back-propagated to obtain the YOLO loss.
  • the gradient of the model parameters is then iterated through the gradient descent method, and the model parameters are updated using the gradient.
  • the YOLO loss value continues to decrease.
  • the training is stopped to obtain the mask recognition model.
  • This method can improve the recognition accuracy of target detection objects through formal training of the YOLOV4-tiny network model.
  • Figure 8 is a mask recognition method provided in Embodiment 4 of the present application.
  • the method of this embodiment uses the mask recognition model trained in Embodiment 1, 2 and 3.
  • the mask recognition model As shown in Figure 8, the mask recognition model
  • the mask identification method includes the following steps.
  • the picture to be recognized contains at least one face target.
  • the image to be recognized is an image obtained from a surveillance video.
  • the surveillance video obtained in real time is converted into a frame-by-frame image through a frame dividing method.
  • An image obtained from a surveillance video contains at least one face target. For example, it can be one face target or three face targets.
  • the image to be recognized is cut into multiple image blocks of the same size.
  • the specific cutting method is the same as the cutting method in Embodiment 1, which will not be described again here.
  • the method for determining the position information of each image block in the picture to be recognized is the same as that in Embodiment 1, and will not be described again here.
  • the number of image blocks is the number of image blocks obtained by cutting the image to be recognized.
  • the first recognition result of each image block obtained through the mask recognition model is the category information of whether the face in the image block wears a mask.
  • the first recognition result is a real number between 0 and 1.
  • the first recognition result value is close to 0 it means that the face target in the image block is not wearing a mask.
  • each image block is calculated based on the position information of each image block where the face target is located in the picture to be recognized, and the face target detection frame in each image block.
  • the confidence of the image block Specifically, based on the position information of each image block where the face target is located in the picture to be recognized, each image block is restored to the picture to be recognized. For each image block, the face in the image block is calculated. The ratio of the target detection frame to the face target detection frame in the image to be recognized, and the calculated ratio is used as the confidence of each image block.
  • the confidence of each image block having the same face target is calculated, and the first recognition result of the image block with the highest confidence is used as the recognition result of the face target.
  • the recognition result of the face target in the picture to be recognized is the first recognition result of the image block corresponding to the face target.
  • the first recognition result of the image block where the face target is located is used. As a result, the recognition results of each face target can be obtained separately.
  • the image obtained from the surveillance video is used as the image to be recognized, and the image to be recognized is cut into multiple image blocks of the same size, and then the image blocks are input into the mask recognition model to obtain the value of each image block.
  • the first recognition result is a recognition result of whether the face target in the image to be recognized wears a mask according to the first recognition result of the image block.
  • the mask recognition model of this application is used to identify and check whether the target object is wearing a mask, and the accuracy of the recognition model is further verified based on the recognition results.
  • Figure 9 is a schematic structural diagram of a training device for a mask recognition model provided in Embodiment 5 of the present application.
  • the training device 90 of the mask recognition model includes: a first cutting module 901, a second cutting module 902, a pre-training module 903, and a formal training module 904.
  • the first cutting module 901 is used to cut each picture in the first training set into multiple image blocks, and label the image blocks to obtain the second training set.
  • the first training set is the first face mask.
  • Data set, the labels of the image blocks in the second training set include the position information of the image block in the picture and the category information of whether the face in the image block wears a mask;
  • the second cutting module 902 is used to cut each picture in the third training set into multiple image blocks, and label the image blocks to obtain a fourth training set.
  • the third training set is the second face mask data set.
  • the label of the image block in the fourth training set includes the position information of the image block in the picture and the category information of whether the face in the image block is wearing a mask;
  • the pre-training module 903 is used to pre-train the first pre-set model using the second training set to obtain the pre-training model.
  • the first pre-set model includes the backbone network and the remaining parts of the network. During the pre-training process, the first pre-set model The parameters of the model's backbone network are frozen;
  • Formal training module 904 is used to perform formal training using the fourth training set and a second preset model to obtain a mask recognition model.
  • the second preset model includes a backbone network and a remaining part of the network.
  • the parameters of the backbone network of the second training model are are the parameters of the backbone network of the pre-trained model, and the initial parameters of the remaining network of the second preset model are the parameters of the remaining network of the pre-trained model.
  • the parameters of the backbone network of the second preset model are frozen during the formal training process. .
  • the first preset model adopts the YOLOV4-tiny network model and the pre-training module 903, which is specifically used for:
  • the first number of image blocks in the second training set are input into the first preset model for training, and the training results are obtained;
  • parameters of the remaining part of the network of the first preset model are updated.
  • the second preset model uses the YOLOV4-tiny network model, the formal training module 904, specifically used for:
  • the first number of image blocks in the fourth training set are input into the second preset model for training, and the training results are obtained;
  • parameters of the remainder of the network of the second preset model are updated.
  • first cutting module 901 and the second cutting module 902 are specifically used for:
  • the coordinates of the frame of the image block are obtained according to the coordinates of the picture frame marked on the picture and the size of the sliding image frame.
  • first cutting module 901 and the second cutting module 902 also include:
  • the enhancement unit is used to perform data enhancement processing on the images in the first training set and the third training set.
  • the data enhancement processing includes one or more of the following processes: randomly adjusting the image size, randomly adjusting the image contrast, and randomly adjusting the image. Hue, randomly adjust picture brightness, randomly add noise to pictures, Randomly change the image color model and randomly crop the image.
  • the size of the image block in the first cutting module 901 or the second cutting module 902 is 416*416 pixels
  • the backbone network in the pre-training module or formal training module includes 6 series-connected cross-stage partial CSP networks. , each CSP network is used to extract features from the input image;
  • the target CSP network in the backbone network is connected to the input of the CAT module.
  • the output of the CAT module is connected to the remaining network.
  • the feature maps extracted by the target CSP network are 26*26 pixels and 13*13 pixels.
  • the CAT module is used to connect the target Feature map extracted by CSP network.
  • the proportion of face targets in some pictures in the first face mask data set is greater than a preset threshold, and the proportion of face targets in the remaining pictures is less than or equal to the preset threshold;
  • Each picture in the second face mask data set includes a proportion of face targets greater than a preset threshold.
  • the device provided in this embodiment can be used to perform the method steps of the above-mentioned Embodiment 1, 2 or 3.
  • the specific implementation methods and technical effects are similar and will not be described again here.
  • Figure 10 is a schematic structural diagram of a mask identification device provided in Embodiment 6 of the present application. As shown in Figure 10, the mask recognition device 10 includes: a cutting module 110, an input module 120, a calculation module 130, and an output module 140.
  • the cutting module 110 is used to cut the picture to be recognized into multiple image blocks, and determine the position information of each image block in the picture to be recognized, where the picture to be recognized contains at least one face target;
  • the input module 120 is used to input multiple image blocks into the mask recognition model to obtain a first recognition result for each image block.
  • the first recognition result is used to indicate whether the face in the image block wears a mask;
  • the calculation module 130 when the same face target in the picture to be recognized exists in different image blocks, detects the face target according to the position information of each image block where the face target is located in the picture to be recognized, and the face target in each image block. frame, calculate the confidence of each image block where the face target is located, and select the first recognition result of the image block with the highest confidence as the recognition result of the face target;
  • the output module 140 is used to output the recognition result of the face target in the picture to be recognized.
  • the device provided in this embodiment can be used to perform the method steps of the fourth embodiment.
  • the specific implementation methods and technical effects are similar and will not be described again here.
  • Figure 11 is an electronic device 11 for mask recognition provided in Embodiment 7 of the present application, including:
  • the memory 112 stores instructions that can be executed by at least one processor 111 , and the instructions are executed by at least one processor 111 so that at least one processor 111 can execute the training method of the mask recognition model as described above.
  • Embodiment 8 of the present application provides a computer-readable storage medium.
  • Computer-executable instructions are stored in the computer-readable storage medium.
  • the computer-executable instructions are executed by a processor, they are used to implement the method steps in the above method embodiments. Specific implementation The methods and technical effects are similar and will not be described again here.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Human Computer Interaction (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

本申请提供一种口罩识别模型的训练方法、装置、设备及存储介质。该方法包括:将第一训练集中的各图片分别切割成多个大小相同的图像块,并对所述图像块进行标签标注,得到第二训练集;将第三训练集中的各图片分别切割成多个大小相同的图像块,并对所述图像块进行标签标注,得到第四训练集;使用所述第二训练集对第一预设模型进行预训练,得到预训练模型;使用所述第四训练集和第二预设模型进行正式训练,得到口罩识别模型;其中,所述第一预设模型和所述第二预设模型可以采用YOLOV4-tiny网络模型,所述YOLOV4-tiny网络模型包括主干网络和剩余部分网络;所述预训练和正式训练中模型的主干网络的参数被冻结。该方法提高了模型的运行速度和距离较远小目标识别的准确性。

Description

口罩识别模型的训练方法、装置、设备及存储介质
本申请要求于2022年05月20日提交国家知识产权局、申请号为202210549746.2、申请名称为“口罩识别模型的训练方法、装置、设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及新一代信息技术领域,尤其涉及一种口罩识别模型的训练方法、装置、设备及存储介质。
背景技术
随着计算机视觉技术的发展,利用计算机视觉技术,从图像中检测人脸并进行口罩穿戴的识别有着非常重要的研究意义和应用价值。
目前,对于口罩的识别大多采用神经网络进行识别,很多安防监控系统通过系统升级来实现口罩识别。例如,采用多任务级联卷积神经网络(Multi-task convolutional neural network,简称MTCNN)作为口罩佩戴识别的网络模型,在光谱图像上标记感兴趣区域(Region of interest,简称ROI),获取坐标和类别信息,训练支持向量机(Support Vector Machine,简称SVM)分类器,进而针对是否佩戴口罩进行分类判断。
现有的识别方法由于模型参数量大,导致运行速度慢,识别周期长,并且对于距离较远的小目标无法准确的识别。
发明内容
本申请提供一种口罩识别模型的训练方法、装置、设备及存储介质,用于提高对于距离较远的小目标识别的准确性,同时解决因模型参数量大导致运行速度慢,识别周期长的问题。
第一方面,本申请实施例提供一种口罩识别模型的训练方法,包括:
将第一训练集中的各图片分别切割成多个图像块,并对图像块进行标签标注,得到第二训练集,该第一训练集为第一人脸口罩数据集,第二训练集中的图像块的标签包括图像块在图片中的位置信息和图像块中人脸是否佩戴口罩的类别信息;
将第三训练集中的各图片分别切割成多个图像块,并对图像块进行标签标注,得到第四训练集,该第三训练集为第二人脸口罩数据集,第四训练集中的图像块的标签包括图像块在图片中的位置信息和图像块中人脸是否佩戴口罩的类别信息;
使用第二训练集对第一预设模型进行预训练,得到预训练模型,该第一预设模型包括主干网络和剩余部分网络,在预训练过程中第一预设模型的主干网络的参数被冻结;
使用第四训练集和第二预设模型进行正式训练,得到口罩识别模型,该第二预设模型包括主干网络和剩余部分网络,第二训练模型的主干网络的参数为预训练模型的主干网络的参数,第二预设模型的剩余部分网络的初始参数为预训练模型的剩余部分网络的参数,在正式训练过程中第二预设模型的主干网络的参数被冻结。
一种可能的实现方式中,第一预设模型采用YOLOV4-tiny网络模型,使用第二训练集对第一预设模型进行预训练,得到预训练模型,包括:
将ImageNet数据集上训练得到的参数加载到第一预设模型的主干网络上;
加载完成之后,冻结第一预设模型的主干网络;
使用如下迭代过程更新第一预设模型的剩余部分网络的参数,直至迭代条件满足,则将训练得到的模型作为预训练模型:
每次将第二训练集中的第一数量的图像块输入第一预设模型进行训练,得到训练结果;
根据输入的图像块的标签以及输入的图像块的训练结果,确定YOLO损失值;
根据YOLO损失值进行反向传播,得到第一预设模型的剩余部分网络的更新参数;
使用更新参数更新第一预设模型的剩余部分网络的参数。
一种可能的实现方式中,第二预设模型采用YOLOV4-tiny网络模型,使用第四训练集和第二预设模型进行正式训练,得到口罩识别模型,包括:
将预训练模型的主干网络的参数加载到第二预设模型的主干网络上;
加载完成之后,冻结第二预设模型的主干网络;
将预训练模型的剩余部分网络的参数加载到第二预设模型的剩余部分网络上;
使用如下迭代过程更新第二预设模型的剩余部分网络的参数,直至迭代条 件满足,则将训练得到的模型作为口罩识别模型:
每次将第四训练集中的第一数量的图像块输入第二预设模型进行训练,得到训练结果;
根据输入的图像块的标签以及输入的图像块的训练结果,确定YOLO损失值;
根据YOLO损失值进行反向传播,得到预设模型的剩余部分网络的更新参数;
使用更新参数更新第二预设模型的剩余部分网络的参数。
一种可能的实现方式中,将第一训练集和第三训练集中的各图片分别切割成多个图像块,包括:
从图片的任意一个角开始,以预设像素滑动图像框,将图像框内的图像切割下来形成图像块,其中,预设像素的大小小于图像框的长度和宽度;
根据图片标有的图片框的坐标和滑动图像框的大小得到图像块的框的坐标。
一种可能的实现方式中,将第一训练集和第三训练集中的各图片分别切割成多个图像块之前,还包括:
对第一训练集和第三训练集中的图片进行数据增强处理,该数据增强处理包括以下处理中的一种或者多种:随机调整图片大小、随机调整图片对比度、随机调整图片色调、随机调整图片亮度、随机为图片添加噪声、随机改变图片色彩模型、随机裁剪图片。
一种可能的实现方式中,图像块的大小为416*416像素,主干网络包括6个串联的跨阶段部分CSP网络,每个CSP网络用于对输入的图像进行特征提取;
主干网络中的目标CSP网络与CAT模块的输入连接,CAT模块的输出端与剩余部分网络连接,目标CSP网络提取的特征图为26*26像素以及13*13像素,CAT模块用于连接目标CSP网络提取的特征图。
一种可能的实现方式中,第一人脸口罩数据集中部分图片的人脸目标的占比大于预设阈值,剩余部分图片的人脸目标的占比小于等于预设阈值;
第二人脸口罩数据集中的各图片包括的人脸目标的占比大于预设阈值。
第二方面,本申请实施例提供的一种口罩识别方法,应用于第一方面所述方法训练得到的口罩识别模型,该口罩识别方法包括:
将待识别图片切割成多个图像块,并确定各图像块在待识别图片中的位置信息,该待识别图片中包含至少一个人脸目标;
将多个图像块输入口罩识别模型,得到每个图像块的第一识别结果,该第 一识别结果用于表示图像块中的人脸是否佩戴口罩;
当待识别图片中的同一人脸目标存在于不同的图像块时,根据人脸目标所在的各图像块在待识别图片中的位置信息,以及各图像块中的人脸目标检测框,计算人脸目标所在的各图像块的置信度,并选取置信度最大的图像块的第一识别结果作为人脸目标的识别结果;
输出待识别图片中人脸目标的识别结果。
一种可能的实现方式中,根据人脸目标所在的各图像块在待识别图片中的位置信息,以及各图像块中的人脸目标检测框,计算人脸目标所在的各图像块的置信度,包括:
根据人脸目标所在的各图像块在待识别图片中的位置信息,将各图像块恢复到待识别图片中;
对于每个图像块,计算图像块中人脸目标检测框与待识别图片中人脸目标检测框的比值;
将计算得到的比值作为图像块的置信度。
第三方面,本申请实施例提供的一种口罩识别模型的训练装置,包括:
第一切割模块,用于将第一训练集中的各图片分别切割成多个图像块,并对图像块进行标签标注,得到第二训练集,该第一训练集为第一人脸口罩数据集,第二训练集中的图像块的标签包括图像块在图片中的位置信息和图像块中人脸是否佩戴口罩的类别信息;
第二切割模块,用于将第三训练集中的各图片分别切割成多个图像块,并对图像块进行标签标注,得到第四训练集,该第三训练集为第二人脸口罩数据集,第四训练集中的图像块的标签包括图像块在图片中的位置信息和图像块中人脸是否佩戴口罩的类别信息;
预训练模块,用于使用第二训练集对第一预设模型进行预训练,得到预训练模型,该第一预设模型包括主干网络和剩余部分网络,在预训练过程中第一预设模型的主干网络的参数被冻结;
正式训练模块,用于使用第四训练集和第二预设模型进行正式训练,得到口罩识别模型,该第二预设模型包括主干网络和剩余部分网络,第二训练模型的主干网络的参数为预训练模型的主干网络的参数,第二预设模型的剩余部分网络的初始参数为预训练模型的剩余部分网络的参数,在正式训练过程中第二预设模型的主干网络的参数被冻结。
一种可能的实现方式中,第一预设模型采用YOLOV4-tiny网络模型,预训 练模块,具体用于:
将ImageNet数据集上训练得到的参数加载到第一预设模型的主干网络上;
加载完成之后,冻结第一预设模型的主干网络;
使用如下迭代过程更新第一预设模型的剩余部分网络的参数,直至迭代条件满足,则将训练得到的模型作为预训练模型:
每次将第二训练集中的第一数量的图像块输入第一预设模型进行训练,得到训练结果;
根据输入的图像块的标签以及输入的图像块的训练结果,确定YOLO损失值;
根据YOLO损失值进行反向传播,得到第一预设模型的剩余部分网络的更新参数;
使用更新参数更新第一预设模型的剩余部分网络的参数。
一种可能的实现方式中,第二预设模型采用YOLOV4-tiny网络模型,正式训练模块,具体用于:
将预训练模型的主干网络的参数加载到第二预设模型的主干网络上;
加载完成之后,冻结第二预设模型的主干网络;
将预训练模型的剩余部分网络的参数加载到第二预设模型的剩余部分网络上;
使用如下迭代过程更新第二预设模型的剩余部分网络的参数,直至迭代条件满足,则将训练得到的模型作为口罩识别模型:
每次将第四训练集中的第一数量的图像块输入第二预设模型进行训练,得到训练结果;
根据输入的图像块的标签以及输入的图像块的训练结果,确定YOLO损失值;
根据YOLO损失值进行反向传播,得到预设模型的剩余部分网络的更新参数;
使用更新参数更新第二预设模型的剩余部分网络的参数。
一种可能的实现方式中,第一切割模块和第二切割模块,具体用于:
从图片的任意一个角开始,以预设像素滑动图像框,将图像框内的图像切割下来形成图像块,其中,预设像素的大小小于图像框的长度和宽度;
根据图片标有的图片框坐标和滑动图像框的大小得到图像块的框的坐标。
一种可能的实现方式中,第一切割模块和第二切割模块,还包括:
增强单元,用于对第一训练集和第三训练集中的图片进行数据增强处理,该数据增强处理包括以下处理中的一种或者多种:随机调整图片大小、随机调整图片对比度、随机调整图片色调、随机调整图片亮度、随机为图片添加噪声、随机改变图片色彩模型、随机裁剪图片。
一种可能的实现方式中,第一切割模块或第二切割模块中图像块的大小为416*416像素,预训练模块或正式训练模块中主干网络包括6个串联的跨阶段部分CSP网络,每个CSP网络用于对输入的图像进行特征提取;
主干网络中的目标CSP网络与CAT模块的输入连接,该CAT模块的输出端与剩余部分网络连接,目标CSP网络提取的特征图为26*26像素以及13*13像素,CAT模块用于连接目标CSP网络提取的特征图。
一种可能的实现方式中,第一人脸口罩数据集中部分图片的人脸目标的占比大于预设阈值,剩余部分图片的人脸目标的占比小于等于预设阈值;
第二人脸口罩数据集中的各图片包括的人脸目标的占比大于预设阈值。
第四方面,一种口罩识别装置,该口罩识别装置包括:
切割模块,用于将待识别图片切割成多个图像块,并确定各图像块在待识别图片中的位置信息,该待识别图片中包含至少一个人脸目标;
输入模块,用于将多个图像块输入口罩识别模型,得到每个图像块的第一识别结果,该第一识别结果用于表示图像块中的人脸是否佩戴口罩;
计算模块,当待识别图片中的同一人脸目标存在于不同的图像块时,根据人脸目标所在的各图像块在待识别图片中的位置信息,以及各图像块中的人脸目标检测框,计算人脸目标所在的各图像块的置信度,并选取置信度最大的图像块的第一识别结果作为人脸目标的识别结果;
输出模块,用于输出待识别图片中人脸目标的识别结果。
一种可能的实现方式中,计算模块,具体用于:
根据人脸目标所在的各图像块在待识别图片中的位置信息,将各图像块恢复到待识别图片中;
对于每个图像块,计算图像块中人脸目标检测框与待识别图片中人脸目标检测框的比值;
将计算得到的比值作为图像块的置信度。
第五方面,一种口罩识别的电子设备,包括:
至少一个处理器;
以及与所述至少一个处理器通信连接的存储器;
其中,存储器存储有可被至少一个处理器执行的指令,指令被至少一个处理器执行,以使至少一个处理器能够执行本申请第一方面提供的口罩识别模型的训练方法。
第六方面,一种计算机可读存储介质,计算机可读存储介质中存储有计算机执行指令,计算机执行指令被处理器执行时用于实现本申请第一方面提供的口罩模型的训练方法。
本申请提供的一种口罩识别模型的训练方法、装置、设备及存储介质,通过对模型结构的设计及对训练集图像块的小目标改进处理,减小了模型参数量,从而加快了运行速度,缩短了识别周期,同时提高了对于距离较远的小目标识别的准确性。
附图说明
此处的附图被并入说明书中并构成本说明书的一部分,示出了符合本申请的实施例,并与说明书一起用于解释本申请的原理。
图1为本申请YOLOV4-tiny网络模型的结构示意图;
图2为本申请YOLOV4-tiny网络模型中CSP网络的结构示意图;
图3为本申请实施例一提供的一种口罩识别模型的训练方法的流程图;
图4为图片切割的一种示意图;
图5为图像块位置信息的一种示意图;
图6为本申请实施例二提供的一种口罩识别模型的预训练方法的流程图;
图7为本申请实施例三提供的一种口中识别模型的正式训练方法的流程图;
图8为本申请实施例四提供的一种口罩识别方法的流程图;
图9为本申请实施例五提供的一种口罩识别模型的训练装置的结构示意图;
图10为本申请实施例六提供的一种口罩识别装置的结构示意图;
图11为本申请实施例七提供的一种口罩识别设备的结构示意图。
通过上述附图,已示出本申请明确的实施例,后文中将有更详细的描述。这些附图和文字描述并不是为了通过任何方式限制本申请构思的范围,而是通过参考特定实施例为本领域技术人员说明本申请的概念。
具体实施方式
这里将详细地对示例性实施例进行说明,其示例表示在附图中。下面的描 述涉及附图时,除非另有表示,不同附图中的相同数字表示相同或相似的要素。以下示例性实施例中所描述的实施方式并不代表与本申请相一致的所有实施方式。相反,它们仅是与如所附权利要求书中所详述的、本申请的一些方面相一致的装置和方法的例子。
本申请实施例提供一种口罩识别模型的训练方法和使用方法,该口罩识别模型用于识别人脸是否佩戴口罩,该口罩识别模型可以采用YOLOV4-tiny网络模型,图1为本申请YOLOV4-tiny网络模型的结构示意图。如图1所示,YOLOV4-tiny网络模型主要由主干网络107、Concat连接模块108和剩余部分网络109组成。其中,主干网络107为6个串联的跨阶段部分网络(Cross Stage Partial network,简称CSP网络),如图中101至106所示,每个CSP网络用于对输入的图像块进行特征提取,各CSP网络提取的特征图的大小不同,按照串联顺序,CSP网络提取的特征图逐渐减小,图1所示结构中,最后两级CSP网络与Concat连接模块108的输入端连接,将这两级CSP网络提取到的目标特征图经Concat连接模块108的输出端进入剩余部分网络109,剩余部分网络109为一个卷积层,对输入的目标特征图进行卷积处理。
示例性的,当输入416*416像素的图像块时,经过主干网络串联的6个跨阶段部分CSP网络依次提取得到的特征图为208*208像素、104*104像素、52*52像素、26*26像素和13*13像素,26*26像素和13*13像素的特征图通过Concat连接模块输入到剩余网络中,经过剩余网络对图像块的处理,最终输出图像块的识别结果。
本申请采用YOLOV4-tiny网络模型对待识别图片中的人脸目标是否佩戴口罩进行识别,该网络模型通过使用多个具有跨跃连接结构的CSP网络,提高了模型的运行速度。
下面结合图2对YOLOV4-tiny网络模型中的CSP网络结构进行详细说明。
图2为本申请YOLOV4-tiny网络模型中CSP网络的结构示意图。如图2所示,CSP网络由输入模块201,输出模块206,卷积层202、203、204、205,CAT模块25、26组成。其中,卷积层202、卷积层203和卷积层204的卷积核大小均为3×3,卷积层205的卷积核大小为1×1。CAT模块25、26的作用为将两个数组在不改变数组特征的前提下相连接。CSP网络中含有两个跳跃连接结构,如图中207和208所示,其作用为将输入的特征图在通道上一分为二,输入的特征图通过跳跃连接结构后只需对一个通道上的特征图进行卷积计算,从而减少了模型的计算量,加快了计算机的运行速度。
具体的,如图2所示,经卷积层202输出的特征图经过跳跃连接结构207时在通道上分为路径21和路径22,经路径21输出的特征图直接与卷积层205输出的特征图在CAT模块26中连接并进入输出模块206。经路径22输出的特征图进入卷积层203,经卷积层203进行卷积计算后输出的特征图经过跳跃连接结构208时在通道上分为路径23和路径24,经路径24输出的特征图进入卷积层204,经路径23输出的特征图与经卷积层204输出的特征图在CAT模块25中连接并进入卷积层205。
示例性的,输入模块201输入64@104*104像素的特征图,进入第一个卷积层202,经卷积函数Conv3×3对特征图进行卷积计算处理后输出特征图,输出的特征图经过跳跃连接结构207时,在通道上将特征图一分为二使得特征图的通道减半。然后将其中一部分32@104*104像素的特征图经路径21进入CAT模块26,另一部分32@104*104像素的特征图经路径22进入卷积层203中,经卷积函数Conv3×3对特征图进行卷积计算处理后输出特征图,输出的特征图经过跳跃连接结构208时,在通道上在通道上将特征图一分为二使得特征图的通道减半。然后将其中一部分16@104*104像素的特征图经路径23进入CAT模块25,另一部分16@104*104像素的特征图经路径24进入卷积层204中,经卷积函数Conv3×3对特征图进行卷积计算处理后输出特征图,输出的特征图进入CAT模块25与经路径23输入的特征图相结合,然后将特征图输入卷积层205,经卷积函数Conv1×1对特征图进行卷积计算处理后将特征图输入CAT模块26,与经路径21输入的特征图在CAT模块26中相结合,然后将特征图经输出模块206输出。
下面以具体地实施例对本申请的技术方案以及本申请的技术方案如何解决上述技术问题进行详细说明。下面这几个具体的实施例可以相互结合,对于相同或相似的概念或过程可能在某些实施例中不再赘述。下面将结合附图,对发明的实施例进行描述。
需要说明的是,下面具体实施例中用到的网络模型采用图1所示的YOLOV4-tiny网络模型。
在使用YOLOV4-tiny网络模型识别待测图片中人脸目标是否佩戴口罩前,需要对YOLOV4-tiny网络模型中的参数进行训练得到具有最优识别效率的YOLOV4-tiny网络模型。下面,结合图3对YOLOV4-tiny网络模型的训练进行详细说明。
图3为本申请实施例一提供的一种口罩识别模型的训练方法的流程图。如 图3所示,该口罩识别模型的训练方法包括以下步骤。
S301,将第一训练集中的各图片分别切割成多个图像块,并对该图像块进行标签标注,得到第二训练集。
第一训练集为第一人脸口罩数据集,具体的,第一人脸口罩数据集中部分图片的人脸目标的占比大于预设阈值,剩余部分图片的人脸目标的占比小于等于预设阈值,示例性的,该数据集可以为口罩人脸数据集(FaceMask_CelebA),也可以为口罩遮挡人脸数据集(Real-Word Masked Face Dataset,简称RMFD)中真实口罩人脸识别数据集、模拟口罩人脸识别数据集和真实口罩人脸验证数据集的一种或几种。
具体的,该数据集包含大量图片以及图片的标签,该标签为人脸是否佩戴口罩的类别信息。示例性的,当图片中的人脸佩戴口罩时,其类别信息为1,人脸没有佩戴口罩时,其类别信息为0,当然,类别信息还可以通过其他形式表示,本实施例不对此进行限制。
将第一训练集中的各图片切割成多个图像块,图像块的大小可以相同,也可以不同,本实施例不对此进行限定。并对切割得到的图像块进行标签标注,具体的,该标签包括图像块在各图片中的位置信息和图像块中人脸是否佩戴口罩的类别信息。
目前口罩佩戴的识别方法对于距离较远的小目标检测效果较差,为了提高对距离较远的小目标的检测效果,本申请采用先放大再检测的方式,即将训练集中的各图片切割成多个图像块,然后对图像块进行检测。
图4为图片切割的一种示意图。如图4所示,使用一定大小的图像框40按预设像素依次在图片上滑动,并将每个图像框内的区域切割下来得到图像块,其中,预设像素的大小小于一定大小图像框的长度和宽度。示例性的,图像框40的大小可以为416*416,预设像素大小可以为300。
如图4所示,一种可能的实施方式中,从图片的左上角开始,使用固定大小的图像框从左向右按预设像素依次滑动,并将图像框内的图像切割下来形成多个图像块。在图像框滑动到最右端时,将图像框按滑动步长向下滑动一个步长,然后,按预设像素从左向右依次滑动或者从右向左依次滑动,按照上述方式完成图像块的切割。其中,滑动步长为多个像素的大小,滑动步长和预设像素可以相同,也可以不同。
另一种可能的实施方式中,从图片的左上角开始,使用大小一定的框在图片上从上向下按预设像素的大小依次滑动,并将图像框内的图像切割下来形成 多个图像块。在图像框滑动到最下端时,将图像框按滑动步长向右滑动一个步长,然后,按预设像素从上向下依次滑动或者从下向上依次滑动,按照上述方式完成图像块的切割。其中,滑动步长为多个像素的大小,滑动步长和预设像素可以相同,也可以不同。
将切割后的图像块进行标签标注得到第二训练集,第二训练集中的图像块的标签包括图像块在图片中的位置信息和图像块中人脸是否佩戴口罩的类别信息,图像块的类别信息与切割之前图片的类别信息相同,即切割之前图片中人脸的类别信息为佩戴口罩,那么切割后各个图像块中人脸的类别信息也为佩戴口罩,其中,在切割完成之后,有部分图像块中可能没有人脸,那么没有人脸的图像块的类别信息为没有佩戴口罩。
图像块的位置信息可以为图像块在图片中切割点的位置信息,可以根据切割之前图片的坐标得到。图5为图像块位置的一种示意图。如图5所示,定义图片的点A(0,0)为坐标原点,点B和点C分别为第一个图像块和第二个图像块在图片中的位置信息。可以理解,这里图像块的位置并不是图像块的物理坐标,而是图像块的像素位置。示例性的,当切割框的大小为416*416,滑动像素为300,则得到第一个图像块的位置信息为B(416,416),第二个图像块的位置信息为C(716,416)。依次类推,根据图片的坐标和切割时图像框的大小以及滑动像素进行平移操作,可以得到每个图像框在图片中的位置信息。
可选的,将第一训练集中的各图片分别切割成多个图像块之前,需要对第一训练集的各图片进行数据增强处理。具体的,数据增强处理的方式可以为以下处理方式中的一种或几种:随机调整图片大小、随机调整图片对比度、随机调整图片色调、随机调整图片亮度、随机为图片添加噪声、随机改变图片色彩模型、随机裁剪图片。
示例性的,随机调整图片亮度和对比度,可以改变图片的质量,使图片的质量与真实拍摄场景下因空气质量等环境因素造成的成像质量不一的情况相契合;随机裁剪图片,可以改变图片中人脸目标的位置,使图片中人脸目标的位置与真实场景下人脸目标位置变化导致的前景和后景的景深变化相契合;由此可见,通过对数据集图片进行增强处理,使最终训练集中的图片包含了常规取景状态下可能存在的各种影响成像质量的问题。
S302,将第三训练集中的各图片分别切割成多个图像块,并对图像块进行标签标注,得到第四训练集。
第三训练集为第二人脸口罩数据集,具体的,第二人脸口罩数据集中的各 图片包括的人脸目标的占比大于预设阈值,该数据集中的图片是工作人员通过日常拍摄得到的,较第一人脸口罩数据集中的图片相比,该数据集中的图片更接近于监控摄像头下拍摄的图片,即图片中人脸目标更接近于小目标。本申请通过对该非公开的人脸训练集中的图片进行训练,可以提升对小目标检测的准确性。
具体的,该数据集包含大量图片以及图片的标签,该标签为人脸是否佩戴口罩的类别信息。示例性的,当图片中的人脸佩戴口罩时,其类别信息为1,人脸没有佩戴口罩时,其类别信息为0,当然,类别信息还可以通过其他形式表示,本实施例不对此进行限制。
将切割后的图像块进行标签标注得到第四训练集,第四训练集中的图像块的标签包括图像块在图片中的位置信息和图像块中人脸是否佩戴口罩的类别信息,图像块的类别信息与切割之前图片的类别信息相同,即切割之前图片中人脸的类别信息为佩戴口罩,那么切割后各个图像块中人脸的类别信息也为佩戴口罩,其中,在切割完成之后,有部分图像块中可能没有人脸,那么没有人脸的图像块的类别信息为没有佩戴口罩。
具体的,对第三训练集中图片的切割方法以及标注图像块位置信息的方法与步骤S301相同,这里不再赘述。
可选的,将第三训练集中的各图片分别切割成多个图像块之前,需要对第三训练集的各图片进行数据增强处理,其增强处理方式及其效果与步骤S301中相同,这里不再赘述。
S303,使用第二训练集对第一预设模型进行预训练,得到预训练模型。
第一预设模型采用YOLOV4-tiny网络模型,YOLOV4-tiny网络模型可以采用如图1所示的结构,可以理解,YOLOV4-tiny网络模型的网络结构可以变换,例如,CSP网络的数量大于或小于图1所示网络模型。将第二训练集中的图像块每次按一定数量输入第一预设模型中进行模型预训练,示例性的,每次输入图像块的数量可以为16张,相应的,会得到16个图像块的输出结果,该输出结果用于表示图像块中的人脸是否佩戴口罩。根据模型训练的输出结果与图像块的标签得到YOLO损失值,通过YOLO损失值进行反向传播,利用梯度下降法进行迭代不断更新模型参数,直到迭代结束得到预训练模型。
第一预设模型包括主干网络和剩余部分网络,其中,在预训练过程中第一预设模型的主干网络的参数被冻结,主干网络的参数被冻结是指主干网络的参数在迭代训练过程中不更新,保持不变,只更新剩余部分网络的参数,通过冻 结主干网络的参数,可以加快训练过程,减少训练时间。
S304,使用第四训练集和第二预设模型进行正式训练,得到口罩识别模型。
第二预设模型采用YOLOV4-tiny网络模型,YOLOV4-tiny网络模型可以采用图1所示的结构。将第四训练集中的图像块每次按一定数量输入第一预设模型中进行模型预训练,示例性的,每次输入图像块的数量可以为16张,相应的,会得到16个图像块的输出结果,该输出结果用于表示图像块中的人脸是否佩戴口罩。根据模型训练的输出结果与图像块的标签得到YOLO损失值,通过YOLO损失值进行反向传播,利用梯度下降法进行迭代不断更新模型参数,直到迭代结束得到口罩识别模型。
第二预设模型包括主干网络和剩余部分网络,该第二预设模型的主干网络的参数为预训练模型的主干网络的参数,第二预设模型的剩余部分网络的初始参数为预训练模型的剩余部分网络的参数,通过预训练将剩余部分网络的参数置为一个较好的状态,可以加快正式训练中模型的收敛。
其中,在进行正式训练的过程中与预训练过程相同,主干网络的参数被冻结,只更新剩余部分网络的参数。
本实施例中,通过对第一训练集和第三训练集中的图片进行切割处理,分别得到第二训练集和第四训练集,并通过将第二训练集中的图像块输入第一预设模型进行预训练得到预训练模型,在预训练模型的基础上将第四训练集中图像块输入第二预设模型进行正式训练,得到口罩识别模型。该方法通过将图片切割为多个图像块,对小的图像块进行训练和识别,提高了对于距离较远的小目标识别的准确率。
图6为本申请实施例二提供的一种口罩识别模型的预训练方法的流程图。该实施例是对实施例一中步骤S303的详细说明。如图6所示,该口罩识别模型的预训练方法包括以下步骤。
S601,将ImageNet数据集上训练得到的参数加载到第一预设模型的主干网络上。
将ImageNet数据集中的图像块输入到YOLOV4-tiny网络模型的主干网络上,并在主干网络上对模型的参数进行训练,最后将训练好的各个参数依次填入主干网络的对应位置。
示例性的,主干网络的模型为n=Ax+By+Cz时,将ImageNet数据集输入主干网络并进行训练,可以得到参数A、B、C的值并将该参数值填入主干网络的对应位置。
S602,加载完成之后,冻结第一预设模型的主干网络。
根据步骤S601,将各个参数依次填入主干网络的对应位置后,冻结主干网络即使得在进行预训练时主干网络部分的参数不会被更改。
S603,每次将第二训练集中的第一数量的图像块输入第一预设模型进行训练,得到训练结果。
将第二训练集中的第一数量的图像块输入YOLOV4-tiny网络模型,示例性的,第一数量可以为16张图像块。每次输入第一数量的图像块,直到将第二训练集中的图像块全部输入YOLOV4-tiny网络模型中,根据每次输入的第一数量的图像块对模型进行训练得到每次的训练结果,并将训练结果输出。
S604,根据输入的图像块的标签以及输入的图像块的训练结果,确定YOLO损失值。
输入的图像块的标签为人脸是否佩戴口罩的类别信息,示例性的,即佩戴口罩的类别信息为1,没有佩戴口罩的类别信息为0。图像块的训练结果为0-1之间的实数,示例性的,可以为0.12,也可以为0.9,数值接近于1时,表示该图像块的人脸目标佩戴口罩,数值接近于0时,表示该图像块的人脸目标没有佩戴口罩。将输入图像块的标签和输入图像块的训练结果的差值作为YOLO损失值。
S605,根据YOLO损失值进行反向传播,得到第一预设模型的剩余部分网络的更新参数。
根据YOLO损失值进行反向传播即求YOLO损失值对模型各参数的梯度,之后通过梯度下降法利用该梯度更新第一预设模型中剩余部分网络模型的参数。梯度下降是迭代法的一种,简单来说是一种寻找目标函数最小化的方法,即在求解损失函数的最小值时,可以通过梯度下降法来一步步的迭代求解,得到最小化的损失函数和模型参数值。
S606,使用更新参数更新第一预设模型的剩余部分网络的参数。
S607,判断迭代条件是否满足。
每次更新完成后,判断迭代条件是否满足,当迭代条件满足时,执行步骤S608,当迭代条件不满足时,返回执行步骤S603。
该迭代条件例如为完成预设次数的迭代训练,该预设次数可以为100或120等,其中,一次迭代训练是指将第二训练集中的全部图像块进行一次训练。
S608、将训练得到的模型作为预训练模型。
通过步骤S603-608的步骤迭代更新第一预设模型的剩余部分网络的参数, 直至迭代条件满足,则将训练得到的模型作为预训练模型。
将预训练结束后输出的模型各个参数依次填入第一预设网络模型的剩余部分网络的对应位置,用于对第二预设网络模型进行正式训练。
本实施例中,将ImageNet数据集上训练得到的参数加载到第一预设网络模型的主干网络上,并将主干网络冻结。利用第二训练集的数据对第一预设网络模型进行预训练,并根据输入图像块的标签和输入图像块的训练结果的差值得到YOLO损失值,将YOLO损失值进行反向传播得到YOLO损失值对模型参数的梯度,之后通过梯度下降法利用该梯度更新模型参数,通过迭代得到预训练模型。该方法通过对第一预设网络模型进行预训练,可以使第一预设网络模型的参数置为一个较好的状态,从而加速正式训练中第二预设网络模型的收敛。
图7为本申请实施例三提供的一种口罩识别模型的正式训练方法的流程图。该实施例是对实施例一中步骤S304的详细说明。如图7所示,该口罩识别模型的正式训练方法包括以下步骤。
S701,将预训练模型中主干网络的参数加载到第二预设模型的主干网络上。
将实施例二中得到的预训练模型中主干网络上的参数依次填入第二预设模型中主干网络对应的位置。
S702,加载完成之后,冻结第二预设模型的主干网络。
冻结第二预设模型中的主干网路,即在对第二预设模型剩余部分网络进行训练时,保证其主干网络的参数不被更改。
S703,将预训练模型的剩余部分网络的参数加载到第二预设模型的剩余部分网络上。
将实施例二中得到的预训练模型的剩余部分网络的各个参数依次填入第二预设模型中剩余部分网络的对应位置上,在预训练得到的参数的基础上对第二预设模型进行正式训练,有利于加速正式训练中第二预设网络模型的收敛,从而提高训练的效率。
S704,每次将第四训练集中的第一数量的图像块输入第二预设模型进行训练,得到训练结果。
将第四训练集中的第一数量的图像块输入第二预设网络模型,具体的输入图像块的第一数量及方法与实施例二中步骤S603相同,这里不再赘述。
S705,根据输入的图像块的标签以及输入的图像块的训练结果,确定YOLO损失值。
输入图像块的标签信息与实施例二中步骤S604中图像块的信息相同,这里 不再赘述。将输入图像块的标签和输入图像块的训练结果的差值作为YOLO损失值。
S706,根据YOLO损失值进行反向传播,得到第二预设模型的剩余部分网络的更新参数。
根据YOLO损失值进行反向传播即求YOLO损失值对模型各参数的梯度,之后通过梯度下降法进行迭代,利用该梯度更新模型参数。,在进行模型正式训练的过程中,该阶段,需要根据具体的训练指标要求设定训练阶段的相关参数,这些参数包括学习率、迭代次数和衰减策略等,通过人为调节这些相关参数,可以加速模型的训练,同时使训练得到的模型参数更优,进一步可以提高模型识别的准确率。
S707,使用更新参数更新第二预设模型的剩余部分网络的参数。
S708,判断迭代条件是否满足。
每次更新完成后,判断迭代条件是否满足,当迭代条件满足时,执行步骤S709,当迭代条件不满足时,返回执行步骤S704。
该迭代条件例如为YOLO损失值减小到预设的YOLO损失值,且YOLO损失值在连续多次训练过程中不再明显变化,其中,YOLO损失值的变化可以通过方差判断。具体的,利用梯度下降法进行迭代训练时,训练期间YOLO损失值不断减小,当YOLO损失值减小到一个较低水平且不再明显减小即得到最小YOLO损失值,此时迭代完成,停止训练。
S709,将正式训练中得到的各个模型参数依次填入第二预设模型中剩余部分网络的对应位置,得到口罩识别模型。
本实施例中,在预训练模型的基础上,将第二预设模型中的主干网络冻结,其中主干网络的参数与预训练模型中的相同。利用第四训练集的数据对YOLOV4-tiny网络模型进行正式训练,并根据输入图像块的标签和输入图像块的训练结果的差值得到YOLO损失值,将YOLO损失值进行反向传播得到YOLO损失值对模型参数的梯度,之后通过梯度下降法进行迭代,利用该梯度更新模型参数。在训练过程中,YOLO损失值不断减小,当YOLO损失值减小到一个较低水平且不再明显减小即得到最小YOLO损失值时,停止训练得到口罩识别模型。该方法通过对YOLOV4-tiny网络模型的正式训练,可以提高对目标检测对象的识别准确率。
图8为本申请实施例四提供的一种口罩识别方法,本实施例的方法使用实施例一、实施例二、实施例三中训练得到的口罩识别模型,如图8所示,该口 罩识别方法包括以下步骤。
S801,将待识别图片切割成多个图像块,并确定各图像块在该待识别图片中的位置信息,该待识别图片中包含至少一个人脸目标。
待识别图片为从监控视频中获取的图像,具体的,通过分帧的方法将实时获取的监控视频转换为逐帧图像。从监控视频中获取的一个图像中包含至少一个人脸目标,示例性的,可以为1个人脸目标,也可以为3个人脸目标。
将待识别图片切割成多个大小相同的图像块,其具体的切割方法与实施例一中的切割方法相同,这里不再赘述。
确定各图像块在待识别图片中的位置信息,其方法与实施例一中的方法相同,这里不再赘述。
S802,将多个图像块输入口罩识别模型,得到每个图像块的第一识别结果,该第一识别结果用于表示图像块中的人脸是否佩戴口罩。
将多个图像块输入口罩识别模型,该图像块的个数为根据待识别图片切割得到的图像块的个数。
通过口罩识别模型得到的各个图像块的第一识别结果为图像块中人脸是否佩戴口罩的类别信息,其第一识别结果为0-1之间的实数,当第一识别结果值接近于1时,表示该图像块中的人脸目标有佩戴口罩,当第一识别结果值接近于0时,表示该图像块中的人脸目标没有佩戴口罩。
S803,当待识别图片中的同一人脸目标存在于不同的图像块时,计算人脸目标所在的各图像块的置信度,并选取置信度最大的图像块的第一识别结果作为该人脸目标的识别结果。
当待识别图片中的同一人脸目标存在于不同的图像块时,根据该人脸目标所在的各图像块在待识别图片中的位置信息,以及各图像块中的人脸目标检测框计算各图像块的置信度,具体的,根据人脸目标所在的各图像块在待识别图片中的位置信息,将各图像块恢复到待识别图片中,对于每个图像块,计算图像块中人脸目标检测框与待识别图片中人脸目标检测框的比值,并将计算得到的比值作为每个图像块的置信度。
通过计算具有同一人脸目标的各图像块的置信度,并将置信度最大的图像块的第一识别结果作为该人脸目标的识别结果。
S804,输出待识别图片中人脸目标的识别结果。
待识别图片中人脸目标的识别结果即人脸目标对应的图像块的第一识别结果,当待识别图片中有多个人脸目标时,根据人脸目标所在图像块的第一识别 结果可以分别得到每个人脸目标的识别结果。
本实施例中,将从监控视频中获取的图像作为待识别图片,并将该待识别图片切割成多个大小相同的图像块,然后将图像块输入到口罩识别模型中得到每个图像块的第一识别结果,根据图像块的第一识别结果得到待识别图片中人脸目标是否佩戴口罩的识别结果。其中,当待识别图片中的同一人脸目标存在与不同的图像块时,计算每个图像块的置信度并选取置信度最大的图像块的第一识别结果用于表示图像块中的人脸是否佩戴口罩。该方法,使用本申请的口罩识别模型对目标对象是否佩戴口罩进行识别检查,根据识别结果进一步验证了识别模型的准确性。
图9为本申请实施例五提供的一种口罩识别模型的训练装置的结构示意图。如图9所示,该口罩识别模型的训练装置90包括:第一切割模块901,第二切割模块902,预训练模块903,正式训练模块904。
其中,第一切割模块901,用于将第一训练集中的各图片分别切割成多个图像块,并对图像块进行标签标注,得到第二训练集,第一训练集为第一人脸口罩数据集,第二训练集中的图像块的标签包括图像块在图片中的位置信息和图像块中人脸是否佩戴口罩的类别信息;
第二切割模块902,用于将第三训练集中的各图片分别切割成多个图像块,并对图像块进行标签标注,得到第四训练集,第三训练集为第二人脸口罩数据集,第四训练集中的图像块的标签包括图像块在图片中的位置信息和图像块中人脸是否佩戴口罩的类别信息;
预训练模块903,用于使用第二训练集对第一预设模型进行预训练,得到预训练模型,该第一预设模型包括主干网络和剩余部分网络,在预训练过程中第一预设模型的主干网络的参数被冻结;
正式训练模块904,用于使用第四训练集和第二预设模型进行正式训练,得到口罩识别模型,该第二预设模型包括主干网络和剩余部分网络,第二训练模型的主干网络的参数为预训练模型的主干网络的参数,第二预设模型的剩余部分网络的初始参数为预训练模型的剩余部分网络的参数,在正式训练过程中第二预设模型的主干网络的参数被冻结。
一种可能的实现方式中,第一预设模型采用YOLOV4-tiny网络模型,预训练模块903,具体用于:
将ImageNet数据集上训练得到的参数加载到第一预设模型的主干网络上;
加载完成之后,冻结第一预设模型的主干网络;
使用如下迭代过程更新第一预设模型的剩余部分网络的参数,直至迭代条件满足,则将训练得到的模型作为预训练模型:
每次将第二训练集中的第一数量的图像块输入第一预设模型进行训练,得到训练结果;
根据输入的图像块的标签以及输入的图像块的训练结果,确定YOLO损失值;
根据YOLO损失值进行反向传播,得到第一预设模型的剩余部分网络的更新参数;
使用更新参数更新第一预设模型的剩余部分网络的参数。
一种可能的实现方式中,第二预设模型采用YOLOV4-tiny网络模型,正式训练模块904,具体用于:
将预训练模型的主干网络的参数加载到第二预设模型的主干网络上;
加载完成之后,冻结第二预设模型的主干网络;
将预训练模型的剩余部分网络的参数加载到第二预设模型的剩余部分网络上;
使用如下迭代过程更新第二预设模型的剩余部分网络的参数,直至迭代条件满足,则将训练得到的模型作为口罩识别模型:
每次将第四训练集中的第一数量的图像块输入第二预设模型进行训练,得到训练结果;
根据输入的图像块的标签以及输入的图像块的训练结果,确定YOLO损失值;
根据YOLO损失值进行反向传播,得到预设模型的剩余部分网络的更新参数;
使用更新参数更新第二预设模型的剩余部分网络的参数。
一种可能的实现方式中,第一切割模块901和第二切割模块902,具体用于:
从图片的任意一个角开始,以预设像素滑动图像框,将图像框内的图像切割下来形成图像块,其中,预设像素的大小小于图像框的长度和宽度;
根据图片标有的图片框坐标和滑动图像框的大小得到图像块的框的坐标。
一种可能的实现方式中,第一切割模块901和第二切割模块902,还包括:
增强单元,用于对第一训练集和第三训练集中的图片进行数据增强处理,该数据增强处理包括以下处理中的一种或者多种:随机调整图片大小、随机调整图片对比度、随机调整图片色调、随机调整图片亮度、随机为图片添加噪声、 随机改变图片色彩模型、随机裁剪图片。
一种可能的实现方式中,第一切割模块901或第二切割模块902中图像块的大小为416*416像素,预训练模块或正式训练模块中主干网络包括6个串联的跨阶段部分CSP网络,每个CSP网络用于对输入的图像进行特征提取;
主干网络中的目标CSP网络与CAT模块的输入连接,该CAT模块的输出端与剩余部分网络连接,目标CSP网络提取的特征图为26*26像素以及13*13像素,CAT模块用于连接目标CSP网络提取的特征图。
一种可能的实现方式中,第一人脸口罩数据集中部分图片的人脸目标的占比大于预设阈值,剩余部分图片的人脸目标的占比小于等于预设阈值;
第二人脸口罩数据集中的各图片包括的人脸目标的占比大于预设阈值。
本实施例提供的装置可用于执行上述实施例一、实施例二或实施例三的方法步骤,具体实现方式和技术效果类似,这里不再赘述。
图10为本申请实施例六提供的一种口罩识别装置的结构示意图。如图10所示,该口罩识别装置10包括:切割模块110,输入模块120,计算模块130,输出模块140。
其中,切割模块110,用于将待识别图片切割成多个图像块,并确定各图像块在待识别图片中的位置信息,该待识别图片中包含至少一个人脸目标;
输入模块120,用于将多个图像块输入口罩识别模型,得到每个图像块的第一识别结果,该第一识别结果用于表示图像块中的人脸是否佩戴口罩;
计算模块130,当待识别图片中的同一人脸目标存在于不同的图像块时,根据人脸目标所在的各图像块在待识别图片中的位置信息,以及各图像块中的人脸目标检测框,计算人脸目标所在的各图像块的置信度,并选取置信度最大的图像块的第一识别结果作为人脸目标的识别结果;
输出模块140,用于输出待识别图片中人脸目标的识别结果。
本实施例提供的装置可用于执行上述实施例四的方法步骤,具体实现方式和技术效果类似,这里不再赘述。
图11为本申请实施例七提供的一种口罩识别的电子设备11,包括:
至少一个处理器111;以及
与至少一个处理器111通信连接的存储器112;其中,
存储器112存储有可被至少一个处理器111执行的指令,指令被至少一个处理器111执行,以使至少一个处理器111能够执行如上所述的口罩识别模型的训练方法。
处理器111的具体实现过程可参见上述方法实施例,具体实现方式和技术效果类似,这里不再赘述。
本申请实施例八提供一种计算机可读存储介质,计算机可读存储介质中存储有计算机执行指令,计算机执行指令被处理器执行时用于实现如上所述方法实施例中的方法步骤,具体实现方式和技术效果类似,这里不再赘述。
本领域技术人员在考虑说明书及实践这里公开的发明后,将容易想到本申请的其它实施方案。本申请旨在涵盖本申请的任何变型、用途或者适应性变化,这些变型、用途或者适应性变化遵循本申请的一般性原理并包括本申请未公开的本技术领域中的公知常识或惯用技术手段。说明书和实施例仅被视为示例性的,本申请的真正范围和精神由下面的权利要求书指出。
应当理解的是,本申请并不局限于上面已经描述并在附图中示出的精确结构,并且可以在不脱离其范围进行各种修改和改变。本申请的范围仅由所附的权利要求书来限制。

Claims (16)

  1. 一种口罩识别模型的训练方法,其特征在于,包括:
    将第一训练集中的各图片分别切割成多个图像块,并对所述图像块进行标签标注,得到第二训练集,所述第一训练集为第一人脸口罩数据集,所述第二训练集中的图像块的标签包括图像块在图片中的位置信息和图像块中人脸是否佩戴口罩的类别信息;
    将第三训练集中的各图片分别切割成多个图像块,并对所述图像块进行标签标注,得到第四训练集,所述第三训练集为第二人脸口罩数据集,所述第四训练集中的图像块的标签包括图像块在图片中的位置信息和图像块中人脸是否佩戴口罩的类别信息;
    使用所述第二训练集对第一预设模型进行预训练,得到预训练模型,所述第一预设模型包括主干网络和剩余部分网络,在预训练过程中所述第一预设模型的主干网络的参数被冻结;
    使用所述第四训练集和第二预设模型进行正式训练,得到口罩识别模型,所述第二预设模型包括主干网络和剩余部分网络,所述第二训练模型的主干网络的参数为所述预训练模型的主干网络的参数,所述第二预设模型的剩余部分网络的初始参数为所述预训练模型的剩余部分网络的参数,在正式训练过程中所述第二预设模型的主干网络的参数被冻结。
  2. 根据权利要求1所述的方法,其特征在于,所述第一预设模型采用YOLOV4-tiny网络模型,所述使用所述第二训练集对第一预设模型进行预训练,得到预训练模型,包括:
    将ImageNet数据集上训练得到的参数加载到所述第一预设模型的主干网络上;
    加载完成之后,冻结所述第一预设模型的主干网络;
    使用如下迭代过程更新所述第一预设模型的剩余部分网络的参数,直至迭代条件满足,则将训练得到的模型作为所述预训练模型:
    每次将所述第二训练集中的第一数量的图像块输入所述第一预设模型进行训练,得到训练结果;
    根据输入的图像块的标签以及输入的图像块的训练结果,确定YOLO损失值;
    根据所述YOLO损失值进行反向传播,得到所述第一预设模型的剩余部分网络的更新参数;
    使用所述更新参数更新所述第一预设模型的剩余部分网络的参数。
  3. 根据权利要求2所述的方法,其特征在于,所述第二预设模型采用YOLOV4-tiny网络模型,所述使用所述第四训练集和第二预设模型进行正式训练,得到口罩识别模型,包括:
    将所述预训练模型的主干网络的参数加载到所述第二预设模型的主干网络上;
    加载完成之后,冻结所述第二预设模型的主干网络;
    将所述预训练模型的剩余部分网络的参数加载到所述第二预设模型的剩余部分网络上;
    使用如下迭代过程更新所述第二预设模型的剩余部分网络的参数,直至迭代条件满足,则将训练得到的模型作为所述口罩识别模型:
    每次将所述第四训练集中的第一数量的图像块输入所述第二预设模型进行训练,得到训练结果;
    根据输入的图像块的标签以及输入的图像块的训练结果,确定YOLO损失值;
    根据所述YOLO损失值进行反向传播,得到所述预设模型的剩余部分网络的更新参数;
    使用所述更新参数更新所述第二预设模型的剩余部分网络的参数。
  4. 根据权利要求1-3任一项所述的方法,其特征在于,将所述第一训练集和所述第三训练集中的各图片分别切割成多个图像块,包括:
    从所述图片的任意一个角开始,以预设像素滑动图像框,将所述图像框内的图像切割下来形成所述图像块,其中,所述预设像素的大小小于所述图像框的长度和宽度;
    根据所述图片标有的图片框的坐标和所述滑动图像框的大小得到所述图像块的框的坐标。
  5. 根据权利要求4所述的方法,其特征在于,将所述第一训练集和所述第三训练集中的各图片分别切割成多个图像块之前,还包括:
    对所述第一训练集和所述第三训练集中的图片进行数据增强处理,所述数据增强处理包括以下处理中的一种或者多种:随机调整图片大小、随机调整图片对比度、随机调整图片色调、随机调整图片亮度、随机为图片添加噪声、随机改变图片色彩模型、随机裁剪图片。
  6. 根据权利要求2或3任一项所述的方法,其特征在于,所述图像块的大小为416*416像素,所述主干网络包括6个串联的跨阶段部分CSP网络,每个所述 CSP网络用于对输入的图像进行特征提取;
    所述主干网络中的目标CSP网络与CAT模块的输入连接,所述CAT模块的输出端与所述剩余部分网络连接,所述目标CSP网络提取的特征图为26*26像素以及13*13像素,所述CAT模块用于连接所述目标CSP网络提取的特征图。
  7. 根据权利要求1-3任一项所述的方法,其特征在于,所述第一人脸口罩数据集中部分图片的人脸目标的占比大于预设阈值,剩余部分图片的人脸目标的占比小于等于所述预设阈值;
    所述第二人脸口罩数据集中的各图片包括的人脸目标的占比大于所述预设阈值。
  8. 一种口罩识别方法,其特征在于,应用于权利要求1-7任一项所述方法训练得到的口罩识别模型,所述方法包括:
    将待识别图片切割成多个图像块,并确定各所述图像块在所述待识别图片中的位置信息,所述待识别图片中包含至少一个人脸目标;
    将所述多个图像块输入所述口罩识别模型,得到每个图像块的第一识别结果,所述第一识别结果用于表示所述图像块中的人脸是否佩戴口罩;
    当所述待识别图片中的同一人脸目标存在于不同的所述图像块时,根据所述人脸目标所在的各所述图像块在所述待识别图片中的位置信息,以及各所述图像块中的人脸目标检测框,计算所述人脸目标所在的各所述图像块的置信度,并选取置信度最大的图像块的第一识别结果作为所述人脸目标的识别结果;
    输出所述待识别图片中人脸目标的识别结果。
  9. 根据权利要求8所述的方法,其特征在于,所述根据所述人脸目标所在的各所述图像块在所述待识别图片中的位置信息,以及各所述图像块中的人脸目标检测框,计算所述人脸目标所在的各所述图像块的置信度,包括:
    根据所述人脸目标所在的各所述图像块在所述待识别图片中的位置信息,将各所述图像块恢复到所述待识别图片中;
    对于每个图像块,计算所述图像块中人脸目标检测框与所述待识别图片中人脸目标检测框的比值;
    将计算得到的比值作为所述图像块的置信度。
  10. 一种口罩识别模型的训练装置,其特征在于,包括:
    第一切割模块,用于将第一训练集中的各图片分别切割成多个图像块,并对所述图像块进行标签标注,得到第二训练集,所述第一训练集为第一人脸口罩数据集,所述图片标有图片框的坐标和人脸是否佩戴口罩的类别信息,所述 第二训练集中的图像块的标签用于表示图像块的框的坐标和人脸是否佩戴口罩的类别信息;
    第二切割模块,用于将第三训练集中的各图片分别切割成多个图像块,并对所述图像块进行标签标注,得到第四训练集,所述第三训练集为第二人脸口罩数据集,所述图片标有图片框的坐标和人脸是否佩戴口罩的类别信息,所述第四训练集中的图像块的标签用于表示图像块的框的坐标和人脸是否佩戴口罩的类别信息;
    预训练模块,用于使用所述第二训练集对第一预设模型进行预训练,得到预训练模型,所述第一预设模型包括主干网络和剩余部分网络,在预训练过程中所述第一预设模型的主干网络的参数被冻结;
    正式训练模块,用于使用所述第四训练集和第二预设模型进行正式训练,得到口罩识别模型,所述第二预设模型包括主干网络和剩余部分网络,所述第二训练模型的主干网络的参数为所述预训练模型的主干网络的参数,所述第二预设模型的剩余部分网络的初始参数为所述预训练模型的剩余部分网络的参数,在正式训练过程中所述第二预设模型的主干网络的参数被冻结。
  11. 根据权利要求10所述的装置,其特征在于,所述第一预设模型采用YOLOV4-tiny网络模型,所述预训练模块,具体用于:
    将ImageNet数据集上训练得到的参数加载到所述第一预设模型的主干网络上;
    加载完成之后,冻结所述第一预设模型的主干网络;
    使用如下迭代过程更新所述第一预设模型的剩余部分网络的参数,直至迭代条件满足,则将训练得到的模型作为所述预训练模型:
    每次将所述第二训练集中的第一数量的图像块输入所述第一预设模型进行训练,得到训练结果;
    根据输入的图像块的标签以及输入的图像块的训练结果,确定YOLO损失值;
    根据所述YOLO损失值进行反向传播,得到所述第一预设模型的剩余部分网络的更新参数;
    使用所述更新参数更新所述第一预设模型的剩余部分网络的参数。
  12. 根据权利要求11所述的装置,其特征在于,所述第二预设模型采用YOLOV4-tiny网络模型,所述正式训练模块,具体用于:
    将所述预训练模型的主干网络的参数加载到所述第二预设模型的主干网络上;
    加载完成之后,冻结所述第二预设模型的主干网络;
    将所述预训练模型的剩余部分网络的参数加载到所述第二预设模型的剩余部分网络上;
    使用如下迭代过程更新所述第二预设模型的剩余部分网络的参数,直至迭代条件满足,则将训练得到的模型作为所述口罩识别模型:
    每次将所述第四训练集中的第一数量的图像块输入所述第二预设模型进行训练,得到训练结果;
    根据输入的图像块的标签以及输入的图像块的训练结果,确定YOLO损失值;
    根据所述YOLO损失值进行反向传播,得到所述预设模型的剩余部分网络的更新参数;
    使用所述更新参数更新所述第二预设模型的剩余部分网络的参数。
  13. 根据权利要求10所述的装置,其特征在于,所述第一切割模块和所述第二切割模块,具体用于:
    从所述图片的任意一角开始,以预设像素滑动图像框,将所述图像框内的图像切割下来形成所述图像块,其中,所述预设像素的大小小于所述图像框的长度和宽度;
    根据所述图片标有的图片框的坐标和所述滑动图像框的大小得到所述图像块的框的坐标。
  14. 一种口罩识别装置,其特征在于,应用于权利要求1-7任一项所述方法训练得到的口罩识别模型,所述装置包括:
    切割模块,用于将待识别图片切割成多个图像块,并确定各所述图像块在所述待识别图片中的位置信息,所述待识别图片中包含至少一个人脸目标;
    输入模块,用于将所述多个图像块输入所述口罩识别模型,得到每个图像块的第一识别结果,所述第一识别结果用于表示所述图像块中的人脸是否佩戴口罩;
    计算模块,用于当所述待识别图片中的同一人脸目标存在于不同的所述图像块时,根据所述人脸目标所在的各所述图像块在所述待识别图片中的位置信息,以及各所述图像块中的人脸目标检测框,计算所述人脸目标所在的各所述图像块的置信度,并选取置信度最大的图像块的第一识别结果作为所述人脸目标的识别结果;
    输出模块,用于输出所述待识别图片中人脸目标的识别结果。
  15. 一种口罩识别的电子设备,其特征在于,包括:
    至少一个处理器;以及
    与所述至少一个处理器通信连接的存储器;其中,
    所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行权利要求1至9中任一项所述的方法。
  16. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质中存储有计算机执行指令,所述计算机执行指令被处理器执行时用于实现如权利要求1至9任一项所述的方法。
PCT/CN2023/080248 2022-05-20 2023-03-08 口罩识别模型的训练方法、装置、设备及存储介质 WO2023221608A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210549746.2 2022-05-20
CN202210549746.2A CN114898434A (zh) 2022-05-20 2022-05-20 口罩识别模型的训练方法、装置、设备及存储介质

Publications (1)

Publication Number Publication Date
WO2023221608A1 true WO2023221608A1 (zh) 2023-11-23

Family

ID=82724758

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/080248 WO2023221608A1 (zh) 2022-05-20 2023-03-08 口罩识别模型的训练方法、装置、设备及存储介质

Country Status (2)

Country Link
CN (1) CN114898434A (zh)
WO (1) WO2023221608A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117830304A (zh) * 2024-03-04 2024-04-05 浙江华是科技股份有限公司 一种水雾船舶检测方法、系统及计算机存储介质
CN117830304B (zh) * 2024-03-04 2024-05-24 浙江华是科技股份有限公司 一种水雾船舶检测方法、系统及计算机存储介质

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114898434A (zh) * 2022-05-20 2022-08-12 卡奥斯工业智能研究院(青岛)有限公司 口罩识别模型的训练方法、装置、设备及存储介质
CN116052094B (zh) * 2023-03-07 2023-06-09 浙江华是科技股份有限公司 一种船舶检测方法、系统及计算机存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113989179A (zh) * 2021-09-08 2022-01-28 湖南工业大学 基于目标检测算法的列车轮对踏面缺陷检测方法及系统
CN114283469A (zh) * 2021-12-14 2022-04-05 贵州大学 一种基于改进YOLOv4-tiny的轻量型目标检测方法及系统
CN114283462A (zh) * 2021-11-08 2022-04-05 上海应用技术大学 口罩佩戴检测方法及系统
CN114898434A (zh) * 2022-05-20 2022-08-12 卡奥斯工业智能研究院(青岛)有限公司 口罩识别模型的训练方法、装置、设备及存储介质

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113989179A (zh) * 2021-09-08 2022-01-28 湖南工业大学 基于目标检测算法的列车轮对踏面缺陷检测方法及系统
CN114283462A (zh) * 2021-11-08 2022-04-05 上海应用技术大学 口罩佩戴检测方法及系统
CN114283469A (zh) * 2021-12-14 2022-04-05 贵州大学 一种基于改进YOLOv4-tiny的轻量型目标检测方法及系统
CN114898434A (zh) * 2022-05-20 2022-08-12 卡奥斯工业智能研究院(青岛)有限公司 口罩识别模型的训练方法、装置、设备及存储介质

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117830304A (zh) * 2024-03-04 2024-04-05 浙江华是科技股份有限公司 一种水雾船舶检测方法、系统及计算机存储介质
CN117830304B (zh) * 2024-03-04 2024-05-24 浙江华是科技股份有限公司 一种水雾船舶检测方法、系统及计算机存储介质

Also Published As

Publication number Publication date
CN114898434A (zh) 2022-08-12

Similar Documents

Publication Publication Date Title
CN111489403B (zh) 利用gan来生成虚拟特征图的方法及装置
US10991074B2 (en) Transforming source domain images into target domain images
WO2023221608A1 (zh) 口罩识别模型的训练方法、装置、设备及存储介质
CN107945204B (zh) 一种基于生成对抗网络的像素级人像抠图方法
Žbontar et al. Stereo matching by training a convolutional neural network to compare image patches
WO2021036699A1 (zh) 视频帧的信息标注方法、装置、设备及存储介质
CN109583483B (zh) 一种基于卷积神经网络的目标检测方法和系统
CN112529026B (zh) 提供ai模型的方法、ai平台、计算设备及存储介质
CN109413510B (zh) 视频摘要生成方法和装置、电子设备、计算机存储介质
CN110008806A (zh) 存储介质、学习处理方法、学习装置及物体识别装置
EP3847574A1 (en) Detecting objects in video frames using similarity detectors
CN111178161A (zh) 一种基于fcos的车辆追踪方法及系统
CN110570435A (zh) 用于对车辆损伤图像进行损伤分割的方法及装置
CN110827312A (zh) 一种基于协同视觉注意力神经网络的学习方法
CN109902576B (zh) 一种头肩图像分类器的训练方法及应用
US20210390667A1 (en) Model generation
KR20180109658A (ko) 영상 처리 방법과 장치
US11972634B2 (en) Image processing method and apparatus
CN110599453A (zh) 一种基于图像融合的面板缺陷检测方法、装置及设备终端
CN113052008A (zh) 一种车辆重识别方法及装置
US11875490B2 (en) Method and apparatus for stitching images
CN107948586A (zh) 基于视频拼接的跨区域运动目标检测方法和装置
KR101124560B1 (ko) 동영상 내의 자동 객체화 방법 및 객체 서비스 저작 장치
CN111428567B (zh) 一种基于仿射多任务回归的行人跟踪系统及方法
CN105184809A (zh) 运动对象检测方法和运动对象检测装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23806568

Country of ref document: EP

Kind code of ref document: A1