CN113158733B

CN113158733B - Image filtering method and device, electronic equipment and storage medium

Info

Publication number: CN113158733B
Application number: CN202011642690.2A
Authority: CN
Inventors: 陈浩彬; 刘凯鉴; 余世杰; 陈大鹏; 赵瑞
Original assignee: Beijing Sensetime Technology Development Co Ltd
Current assignee: Beijing Sensetime Technology Development Co Ltd
Priority date: 2020-12-30
Filing date: 2020-12-30
Publication date: 2024-01-02
Anticipated expiration: 2040-12-30
Also published as: CN113158733A

Abstract

The embodiment of the application discloses an image filtering method, an image filtering device, electronic equipment and a storage medium. The method comprises the following steps: human body detection is carried out on the image to be filtered, and N human body frames and N human body mask patterns are obtained, wherein the N human body frames and the N human body mask patterns are in one-to-one correspondence, and N is an integer greater than or equal to 1; and filtering the image to be filtered according to the N human frames and the N human mask patterns.

Description

Image filtering method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of image recognition technologies, and in particular, to an image filtering method, an image filtering device, an electronic device, and a storage medium.

Background

Pedestrian re-recognition, face human body recognition and face human body clustering are mostly realized based on a deep learning technology, and the deep learning technology is very dependent on the quality of a training sample set.

At present, in the process of constructing a training sample set, human body and human face detection is often carried out on a shot high-resolution image, then an image containing the human body is cut out from the high-resolution image, the image is marked, and finally the marked image is used as a training sample.

Because of the different factors such as the angle of the shot picture and the light, the cut image possibly contains more than one human body, so that the marking precision is lower, and the quality of the constructed training sample is poorer.

Disclosure of Invention

The embodiment of the application provides an image filtering method, an image filtering device, electronic equipment and a storage medium, wherein before a training sample is constructed, images are filtered, and the quality of the training sample is improved.

In a first aspect, an embodiment of the present application provides an image filtering method, including:

human body detection is carried out on the image to be filtered, and N human body frames and N human body mask patterns are obtained, wherein the N human body frames and the N human body mask patterns are in one-to-one correspondence, and N is an integer greater than or equal to 1;

and filtering the image to be filtered according to the N human frames and the N human mask patterns.

In some possible implementations, the filtering the image to be filtered according to the N human frames and the N human mask graphs includes:

acquiring the confidence coefficient of each human body frame in the N human body frames;

according to the confidence degree of each human body frame and the human body mask diagram corresponding to each human body frame, performing non-maximum suppression on the N human body frames to obtain K human body frames, wherein K is an integer smaller than or equal to N;

And filtering the image to be filtered according to the K human body frames and the K human body mask images corresponding to the K human body frames.

In some possible embodiments, the performing non-maximum suppression on the N human frames according to the confidence coefficient of each human frame and the human mask map corresponding to each human frame to obtain K human frames includes:

the N human frames are ordered in a descending order according to the confidence degree of each human frame, and a first set is obtained;

removing the first human body frame from the first set to obtain a second set, wherein the first human body frame is the human body frame with the highest credibility among the N human body frames;

determining a first intersection ratio between a first human body mask map and a second human body mask map, wherein the first human body mask map is a human body mask map corresponding to the first human body frame, the second human body mask map is a human body mask map corresponding to a second human body frame, and the second human body frame is any human body frame in the second set;

deleting the second human body frame from the second set to obtain a third set under the condition that the first intersection ratio is larger than a first threshold value, keeping the second human body frame in the second set under the condition that the first intersection ratio is smaller than or equal to the first threshold value, and taking the second set as the third set;

And taking the third set as a new first set, performing non-maximum suppression on human frames of the new first set, and taking the human frames removed in the non-maximum suppression process of the N human frames as the K human frames under the condition that elements in the new first set are zero.

In some possible implementations, the determining the first intersection ratio between the first human mask map and the second human mask map includes:

summing pixel values of all pixel points in an intersection area between the first human mask map and the second human mask map to obtain a first mask area;

summing pixel values of all pixel points in the first human mask map to obtain a second mask area;

summing pixel values of all pixel points in the second human mask map to obtain a third mask area;

and determining a first cross-over ratio between the first human mask map and the second human mask map according to the first mask area, the second mask area and the third mask area.

In some possible implementations, the determining a first intersection ratio between the first human mask map and the second human mask map according to the first mask area, the second mask area, and the third mask area includes:

Determining a first ratio of the first mask area to the second mask area;

determining a second ratio of the first mask area to the third mask area;

and taking the largest ratio of the first ratio and the second ratio as the first cross ratio.

summing pixel values of all pixel points in each human body mask diagram in the K human body mask diagrams to obtain mask areas corresponding to each human body mask diagram in the K human body mask diagrams;

determining a third ratio of a mask area of a third human body mask map to a mask area of a fourth human body mask map, wherein the third human body mask map is a human body mask map with the largest mask area in the K human body mask maps, and the fourth human body mask map is a human body mask map with the mask area in the K human body mask maps being inferior to that of the third human body mask map;

and when the third ratio is larger than or equal to a second threshold value, reserving the image to be filtered, and when the third ratio is smaller than the second threshold value, filtering the image to be filtered.

In some possible embodiments, in the case of retaining the image to be filtered, the method further comprises:

performing face detection on the image to be filtered to obtain M face mask images, wherein M is an integer greater than or equal to 0;

and constructing training samples according to the third human mask diagram and the M human face mask diagram.

In some possible implementations, the constructing training samples according to the third human mask map and the M human face mask map includes:

determining a second intersection ratio between each face mask map and the third face mask map in the M face mask maps;

and constructing training samples according to a second cross-correlation ratio between each face mask diagram and the third face mask diagram.

In some possible implementations, the constructing training samples according to the second cross-correlation between each face mask map and the third face mask map includes:

discarding the image to be filtered under the condition that the number of face mask images with the second cross ratio larger than a third threshold value in the M face mask images is larger than or equal to two;

taking the image to be filtered as a first training sample under the condition that the number of face mask images with the second intersection ratio larger than the third threshold value in the M face mask images is one and the face mask images with the second intersection ratio larger than the third threshold value are located in a preset area of the third face mask image;

Taking the image to be filtered as a second training sample under the condition that the number of face mask graphs with the second intersection ratio larger than the third threshold value in the M face mask graphs is zero;

the first training sample is provided with a human face associated with a human body corresponding to the third human mask diagram, and the second training sample is provided with no human face associated with a human body corresponding to the third human mask diagram.

In some possible implementations, the determining a second cross-over ratio between each of the M face mask maps and the third face mask map includes:

determining an intersection area of each face mask image and the third face mask image in the M face mask images, and summing pixel values of pixel points in the intersection area to obtain a fourth mask area corresponding to each face mask image;

summing the pixel values of the pixel points in each face mask map to obtain a fifth mask area corresponding to each face mask map;

and taking the ratio of the fourth mask area corresponding to each face mask image to the fifth mask area corresponding to each face mask image as a second cross-merging ratio between each face mask image and the third mask image.

In some possible embodiments, the method further comprises:

training a first neural network using the first training sample; and/or training a second neural network using the second training samples.

In some possible embodiments, the performing human detection on the image to be filtered to obtain N human frames and N human mask graphs includes:

and inputting the image to be filtered into a neural network which completes training to perform human body detection, so as to obtain N human body frames and N human body mask patterns.

In some possible embodiments, before inputting the to-be-filtered data to the trained neural network to obtain N human frames and N human mask graphs, the method further includes:

obtaining a third training sample and a fourth training sample, wherein the labeling precision of the third training sample on the human body is lower than that of the fourth training sample;

adjusting network parameters of a third neural network by using the third training sample to obtain an adjusted third neural network;

and adjusting the network parameters of the adjusted third neural network by using the fourth training sample to obtain the trained neural network.

In a second aspect, an embodiment of the present application provides an image filtering apparatus, including:

the detection unit is used for detecting the human body of the image to be filtered to obtain N human body frames and N human body mask patterns, wherein the N human body frames and the N human body mask patterns are in one-to-one correspondence, and N is an integer greater than or equal to 1;

and the filtering unit is used for filtering the image to be filtered according to the N human frames and the N human mask images.

In some possible implementations, in filtering the image to be filtered according to the N human frames and the N human mask graphs, the filtering unit is specifically configured to:

In some possible implementations, in terms of performing non-maximum suppression on the N human frames according to the confidence coefficient of each human frame and the human mask map corresponding to each human frame, obtaining K human frames, the filtering unit is specifically configured to:

In some possible embodiments, the filtering unit is specifically configured to, in determining a first intersection ratio between the first human mask map and the second human mask map:

In some possible implementations, the filtering unit is specifically configured to, in determining a first cross-correlation between the first human mask map and the second human mask map according to the first mask area, the second mask area, and the third mask area:

determining a first ratio of the first mask area to the second mask area;

Determining a second ratio of the first mask area to the third mask area;

In some possible embodiments, in the case of retaining the image to be filtered, the apparatus further comprises: a sample construction unit; wherein the sample construction unit is configured to:

In some possible implementations, in constructing training samples according to the third human mask map and the M human face mask map, the sample construction unit is specifically configured to:

In some possible implementations, in constructing training samples according to a second cross-correlation between the each face mask map and the third face mask map, the sample construction unit is specifically configured to:

In some possible implementations, in determining a second cross-correlation between each of the M face mask maps and the third face mask map, the sample construction unit is specifically configured to:

In some possible embodiments, the apparatus further comprises: a first training unit; wherein, the first training unit is used for:

In some possible embodiments, in performing human body detection on an image to be filtered to obtain N human body frames and N human body mask graphs, the filtering unit is specifically configured to:

In some possible implementations, before inputting the to-be-filtered data to the neural network after training to obtain N human frames and N human mask graphs, the apparatus further includes a second training unit, where the second training unit is configured to:

In a third aspect, an embodiment of the present application provides an electronic device, including: and a processor connected to a memory for storing a computer program, the processor being configured to execute the computer program stored in the memory, to cause the electronic device to perform the method according to the first aspect.

In a fourth aspect, embodiments of the present application provide a computer-readable storage medium storing a computer program, the computer program causing a computer to perform the method according to the first aspect.

In a fifth aspect, embodiments of the present application provide a computer program product comprising a non-transitory computer readable storage medium storing a computer program, the computer being operable to cause a computer to perform the method according to the first aspect.

The implementation of the embodiment of the application has the following beneficial effects:

it can be seen that in the embodiment of the present application, human body detection is performed on an image to be filtered first to obtain N human body frames and N human body mask patterns, and the image to be filtered is filtered according to the N human body frames and the N human body mask patterns, so that an image with lower quality can be filtered, and an image with higher quality is retained, so that a training sample is constructed by using an image with higher quality, and the quality of the constructed training sample is higher.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is an application scenario schematic diagram of an image filtering method provided in an embodiment of the present application;

fig. 2 is a schematic flow chart of an image filtering method according to an embodiment of the present application;

fig. 3 is a schematic diagram of an image to be filtered according to an embodiment of the present application;

FIG. 4 is a schematic diagram of another image to be filtered according to an embodiment of the present disclosure;

FIG. 5 is a schematic diagram of another image to be filtered according to an embodiment of the present disclosure;

fig. 6 is a schematic flow chart of neural network training according to an embodiment of the present application;

FIG. 7 is a schematic diagram of a third training sample according to an embodiment of the present disclosure;

FIG. 8 is a schematic diagram of a fourth training sample according to an embodiment of the present disclosure;

fig. 9 is a functional unit block diagram of an image filtering device according to an embodiment of the present application;

fig. 10 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, are intended to be within the scope of the present application.

The terms "first," "second," "third," and "fourth" and the like in the description and in the claims of this application and in the drawings, are used for distinguishing between different objects and not for describing a particular sequential order. Furthermore, the terms "comprise" and "have," as well as any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those listed steps or elements but may include other steps or elements not listed or inherent to such process, method, article, or apparatus.

Reference herein to "an embodiment" means that a particular feature, result, or characteristic described in connection with the embodiment may be included in at least one embodiment of the application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.

In order to better understand the image filtering method according to the embodiment of the present application, a brief description is first given below of a scenario of the image filtering method. The image to be filtered is filtered through the image filtering method, so that a filtering result is obtained, images with higher quality are screened from the images to be filtered, the subsequent use of the images with higher quality to construct training samples is facilitated, and the quality of the training samples is improved.

Referring to fig. 1, fig. 1 is a schematic diagram of an application scenario of an image filtering method according to an embodiment of the present application. As shown in fig. 1, an image including a human body is screened from a large number of original images by a human body detector, and the screened image to be filtered is transmitted to an image filtering device; then, the image filtering device performs human body detection on the image to be filtered to obtain N human body frames and N human body mask patterns, filters the image to be filtered according to the N human body frames and the N human body mask patterns to obtain a filtering result, and constructs a training sample according to the filtering result. It can be seen that before constructing the training sample, the image to be filtered is subjected to quality filtering to obtain a high-quality image so as to improve the quality of the training sample.

It should be noted that the image filtering method of the present application may be applied to an image filtering apparatus. For example, the image filtering apparatus may be an electronic device or a server. The electronic device may include a smart phone, a tablet computer, a palm computer, a notebook computer, a mobile internet device, a vehicle-mounted terminal, a computer device, a wearable device, or the like. The server may be an independent physical server, a server cluster or a distributed system, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, basic cloud computing services such as big data and an artificial intelligent platform, and the like. Therefore, the form of the image filtering apparatus is not limited in this application.

Referring to fig. 2, fig. 2 is a flow chart of an image filtering method according to an embodiment of the present application. The method comprises the following steps:

201: human body detection is carried out on the image to be filtered, and N human body frames and N human body mask patterns are obtained, wherein the N human body frames and the N human body mask patterns are in one-to-one correspondence, and N is an integer greater than or equal to 1.

As shown in fig. 1, the image to be filtered is an image including a human body that is screened from a large number of original images by a human body detector, which may be a general human body detector.

For example, object detection (i.e., human body detection) may be performed on the image to be filtered to obtain N human body frames, and human body segmentation may be performed on each human body frame to obtain N human body mask graphs. The target detection and the human body segmentation of the image to be filtered can be realized through a neural network which completes training, namely, the image to be filtered is input into the neural network which completes training to obtain the N human body frames and the N human body mask patterns, and the training process of the neural network is described in detail later, and is not described too much.

Illustratively, the human mask map of each human frame is used to represent the region in which the human body is located in that human frame. In addition, the human body mask patterns referred to in this application are mask patterns after one-hot encoding (one-hot). Therefore, a pixel with a pixel value of 1 in the human mask map indicates that the pixel is human, and a pixel with a pixel value of 0 indicates that the pixel is not human. Similarly, the face mask pattern referred to later is obtained by single-heat encoding similarly to the body mask pattern, and will not be described.

202: and filtering the image to be filtered according to the N human body frames and the N human body mask patterns.

For example, according to the confidence level of the N human body frames and the N human body mask patterns, non-maximum suppression can be comprehensively performed on the N human body frames, so as to filter the image to be filtered, and thus a training sample with relatively high quality to be filtered image construction can be screened out.

It can be seen that, in the embodiment of the present application, human body detection is performed on an image to be filtered to obtain N human body frames and N human body mask patterns, and the image to be filtered is filtered according to the N human body frames and the N human body mask patterns, so that an image with lower quality can be filtered, for example, no main human body in the image is filtered, and an image with higher quality is retained, so that a training sample is constructed by using the image with higher quality, and the constructed training sample has higher quality.

The process of filtering an image to be filtered in the present application is described below with reference to the accompanying drawings.

The confidence coefficient of each human body frame in the N human body frames is obtained in the process of detecting the target of the human body, and is not described in detail; then, according to the confidence coefficient of each human body frame and the human body mask map corresponding to each human body frame, performing non-maximum suppression on N human body frames to obtain K human body frames, wherein K is an integer smaller than or equal to N; and finally, filtering the image to be filtered according to the K human body frames and K mask patterns corresponding to the K human body frames.

Specifically, firstly, sorting N human frames in a descending order according to the confidence degree of each human frame to obtain a first set; then, the first human body frames are moved out of the first set to obtain a second set, namely, the rest human body frames except the first human body frames in the N human body frames form the second set, wherein the first human body frames are the human body frames with the maximum credibility in the N human body frames; determining a first intersection ratio between a first human mask map and a second human mask map, wherein the first human mask map is a human mask map corresponding to the first human frame, the second human mask map is a human mask map corresponding to a second human frame, and the second human frame is any human frame in the second set; then, deleting the second human body frame from the second set and forming the rest human body frames in the second set into a third set under the condition that the first intersection ratio is larger than a first threshold value; in addition, when the first intersection ratio is smaller than or equal to a first threshold value, the second human frame is reserved in a second set, and the second set at the moment is taken as a third set; then, the third set obtained after the second human body frame is processed (including being kept or deleted) is taken as a new first set, and the process performed on the first set is performed on the new first set until all elements in the new first set are removed, that is, when the elements in the new first set are zero, the human body frame removed in the process of performing non-maximum suppression on the N human body frames is taken as the K human body frames, that is, the human body frames removed from the first set and the new first set are taken as the K human body frames.

For example, an intersection region between a first human mask map and a second human mask map may be determined first; and summing pixel values of all pixel points in an intersection area between the first human mask image and the second human mask image to obtain a first mask area. It should be understood that, since the human mask map is obtained by single-heat encoding, the first mask area is substantially the number of pixels overlapped between the human body corresponding to the first human body frame and the human body corresponding to the second human body frame, and may be also understood as the area overlapped between the human body corresponding to the first human body frame and the human body corresponding to the second human body frame; then, summing pixel values of all pixel points in the first human mask map to obtain a second mask area, wherein the second mask area can be understood as the area of a human body corresponding to the first human frame; and summing pixel values of all pixel points in the second human mask map to obtain a third mask area, wherein the third mask area can be understood as the area of a human body corresponding to the second human frame. Finally, a first intersection ratio between the first human mask map and the second human mask map is determined based on the first mask area, the second mask area, and the third mask area.

Further, determining a first ratio of the first mask area to the second mask area and determining a second ratio of the first mask area to the third mask area; the largest ratio of the first ratio and the second ratio is taken as the first cross ratio. Thus, the first intersection ratio can be expressed by the formula (1):

IOU ₁ ＝max(I/I _A ，I/I _B ) Formula (1);

wherein the IOU ₁ For the first cross-over ratio, I is the first mask area, I _A For the second mask area, I _B For the third mask area,/is a division operation and max is a maximum value operation.

It should be understood that the existing non-maximum suppression is mainly performed by the confidence of the human body frame, so if the set threshold is too small, the intersection ratio of the human body frame of the adjacent person and the current human body frame may be greater than the threshold in the process of performing non-maximum suppression on the current human body frame, and the human body frame of the adjacent person needs to be deleted, namely the human body frame of the adjacent person is deleted by mistake; if the set threshold is too large, the cross ratio of different human frames of the same human body is smaller than the threshold, repeated human frames cannot be deleted, and therefore the same human body can be reserved with a plurality of human frames. It can be seen that in the process of performing non-maximum suppression on the human frames, the confidence of the human frames and the human mask map of the human frames (the human mask map truly reflects the positions and the areas of the human bodies in the human frames) are combined to perform non-maximum suppression, so that the intersection ratio between two human frames is the intersection ratio between two human bodies in the two human frames, and is not the intersection ratio between two human frames. Therefore, if two human frames of the same human body are intersected, the intersection ratio is very large and is not affected by the set threshold value, and if the human frames are adjacent, the intersection ratio is very small (even zero) and is not affected by the set threshold value. Therefore, the non-maximum value in the method can enable the accuracy of the K human frames to be higher and more comprehensive, and improves the screening accuracy of the human frames.

For example, after obtaining K human body frames, summing up the pixel points in each human body frame in the K human body frames to obtain a mask area corresponding to each human body mask map in the K human body mask maps; then, determining a third ratio of a mask area of a third human body mask map to a mask area of a fourth human body mask map, wherein the third human body mask map is a human body mask map with the largest mask area in the K human body mask maps, and the fourth human body mask map is a human body mask map with the mask area in the K human body mask maps being next to the third human body mask map; if the third ratio is greater than or equal to a second threshold, the image to be filtered is reserved; and filtering the image to be filtered under the condition that the third ratio is smaller than a second threshold value.

For example, as shown in fig. 3, in the case that the third ratio is greater than or equal to the second threshold, it indicates that the area occupied by the human body corresponding to the third human body mask image in the image to be filtered is much greater than the area occupied by the human body corresponding to the fourth human body mask image, in other words, the image to be filtered includes a main human body, that is, a human body with a relatively obvious characteristic, and such image to be filtered can be used as a training sample, so that such image to be filtered is retained. In the case that the third ratio is smaller than the second threshold, as shown in fig. 4, the areas occupied by the third human body frame and the fourth human body frame in the image to be filtered are relatively close, which indicates that the image to be filtered does not include a main human body, and the interference ratio between the human bodies is relatively large, for example, many overlapping areas exist between two human bodies as shown in fig. 4, so that many features of a certain human body are blocked, and features between two human bodies interfere with each other in the training process using such an image, so that such an image to be filtered is not suitable for being used as a training sample, and such an image to be filtered is filtered.

In one embodiment of the present application, where the image to be filtered is retained, the retained image to be filtered may be used to construct a training sample. The process of constructing a training sample using the image to be filtered in the present application is described below with reference to the accompanying drawings.

Illustratively, face detection is performed on an image to be filtered to obtain an M-face mask image, where M is an integer greater than or equal to 0. Similarly, face detection is carried out on the image to be filtered, M face frames are predicted, only the M face frames can be predicted, and the M face frames are not required to be output; then, using the region of the face frame as a face mask map, M face mask maps are obtained. For face detection, the selected area in each face frame is basically the face, so the selected area in the face frame can be directly encoded as 1 as a face mask map corresponding to each face frame. Face detection may be achieved through commonly used Face detection networks, such as Face R-CNN, MTCNN, and the like. And then, constructing a training sample according to a third human mask diagram corresponding to the image to be filtered and the M human mask diagrams.

For example, a second intersection ratio between each mask image in the M mask images and the third mask image is determined, for example, an intersection area between each mask image in the M mask images and the third mask image may be determined, and pixel values of pixel points in the intersection area may be summed to obtain a fourth mask area; and then, summing pixel values of pixel points in each face mask map to obtain a fifth mask area corresponding to each face mask map, and taking the ratio of the fourth mask area to the fifth mask area as a second cross-over ratio between each face mask map and the third face mask map. Then, training samples are constructed according to a second intersection ratio between each face mask map and a third face mask map.

Illustratively, in the case that the number of face mask patterns in the second intersection ratio of the M face mask patterns is greater than or equal to two, the image to be filtered is discarded, that is, the image to be filtered is not suitable for constructing training samples. Specifically, the intersection ratio between the face mask map and the third human mask map is greater than a third threshold value, which indicates that the face corresponding to the face mask map is located in a human frame corresponding to the third human mask map, and if one human frame contains more than or equal to two faces, it indicates that the segmentation of the main human body in the image to be filtered is not fine enough, so that the image to be filtered is not suitable for being used as a training sample; and taking the image to be filtered as a first training sample under the condition that the number of the face mask images with the second intersection ratio larger than a third threshold value in the M face mask images is one and the face mask images with the second intersection ratio larger than the third threshold value are positioned in a preset area of the third face mask image. Specifically, when the number of faces in the human body frame corresponding to the third human body mask map is one and the face is located in a preset area in the human body frame, determining that the human body selected by the human body frame corresponding to the third human body mask map has an associated face, that is, the human body in the human body frame is associated with the face, and taking the image to be filtered as a first training sample.

For example, as shown in fig. 5, although two human bodies exist in the image to be filtered, the mask area of one human body is far larger than the area of the other human body, and the image to be filtered is determined to contain the main human body and be the upper half of the image to be filtered; in addition, face detection is performed on the image to be filtered to obtain two faces, only one face in the human mask image corresponding to the main human body is detected, namely, the face positioned above the two faces is detected, and the face is positioned above 1/2 of the human mask height of the main human body, so that the main human body exists in the image to be filtered, and the main human body has an associated face, and therefore the image to be filtered shown in fig. 5 can be used as a first training sample.

The preset area may be more than 1/2 of the area in the vertical direction of the human mask chart of the main human body, and of course, the preset area may also be other areas, such as more than 5/8 of the area in the vertical direction. In essence, the preset area is an area where a face of a person is located, and the preset area is not limited in the present application.

Further, under the condition that the number of face mask images with the second intersection ratio larger than the third threshold value in the M face mask images is zero, the images to be filtered are taken as second training samples. That is, no face exists in the human body frame corresponding to the third human body mask map, and no face associated with the human body in the human body frame exists. Because of the presence of the main human body in the image to be filtered, such an image to be filtered can be used as a second training sample.

It can be seen that in the embodiment of the present application, when a main human body exists in an image to be filtered, the main human body in the image to be filtered is associated with a human face, and according to the association result of the human body and the human face, the image to be filtered is finely divided into different types of training samples, and the image with insufficient segmentation quality is filtered, so that the quality of the constructed training sample is higher. In addition, the images to be filtered are divided into different types of training samples in advance, and the neural network can be trained in a targeted manner by using the different types of training samples, so that the trained neural network can meet expected requirements, and the training success rate is improved.

In one embodiment of the present application, the image to be filtered may be classified into a first training sample or a second training sample according to the quality of the image to be filtered, and different neural networks may be trained for different types of training samples. For example, the first training sample may be used to train a first neural network and/or the second training sample may be used to train a second neural network. Because the first training sample is a training sample of association of a human body and a human face, the first neural network can be a neural network for realizing basic pedestrian identity recognition or auxiliary recognition of pedestrian identity by various information of the human face, for example, the first neural network can be used for executing joint clustering of the human face and the human body, and the like, which needs the task of association recognition of the human face and the human body; the second neural network is used to perform simple pedestrian identification, for example, to perform human archiving, human clustering, human-based identification, and the like, depending on human identification.

Referring to fig. 6, fig. 6 is a schematic flow chart of neural network training according to an embodiment of the present application. The execution body for performing the neural network may be the image filtering apparatus described above, or may be another device, and the present application is not limited thereto. If the execution subject is other equipment, the trained neural network needs to be migrated to the image filtering device so as to detect the human body of the image to be filtered. Wherein, the training method comprises the following steps:

601: and acquiring a third training sample and a fourth training sample, wherein the labeling precision of the human body frame in the third training sample is lower than that of the human body frame in the second training sample.

The third training sample is a relatively rough sample, i.e., the labeling accuracy of the third training sample on the human body is relatively low. Generally, constructing such training samples requires little quality of the original image (the image used to construct the training samples), is easy to obtain, and is enormous in number.

For example, as shown in fig. 7, since the human bodies in the third training sample are crowded together, in the labeling process of the leftmost human body, the human bodies adjacent to the human body are inevitably labeled, so that the labeling precision of the leftmost human body is relatively low.

The fourth training sample is a relatively fine sample, that is, the fourth training sample has high labeling accuracy on the human body, and generally has high quality requirement on the original image, is not easy to obtain, and has a small number. Generally, the fourth training sample includes only one human body, and even if the fourth training sample includes a plurality of human bodies, the relative distances between the plurality of human bodies in the fourth training sample are relatively far, so that the human body frame marked for each human body does not include other human bodies.

As shown in fig. 8, the fourth training sample only includes one human body, and the human body frame of the human body does not include other human bodies, so that the labeling accuracy of the human body is relatively high.

602: and adjusting network parameters of the third neural network by using the third training sample to obtain an adjusted third neural network.

The third neural network is an initially constructed neural network for human body detection and segmentation. Thus, the third neural network may be trained using a third training sample to adjust network parameters of the third neural network.

It should be appreciated that during the actual training process, a plurality of third training samples may be used to train the third neural network until the third neural network converges to obtain an adjusted third neural network.

603: and adjusting the network parameters of the adjusted third neural network by using a fourth training sample to obtain the neural network with the training completed.

Further, the network parameters of the adjusted third neural network are adjusted using the fourth training sample. Because the human body labeling precision in the fourth training sample is higher, the fourth training sample pair is used for training, namely the fourth training sample pair is used for performing fine tuning (tuning-tuning) on the adjusted third neural network until the adjusted third neural network is converged, and the neural network for completing training is obtained.

It can be seen that in the embodiment, a coarse training sample (third training sample) is used first, and because such training sample is relatively easy to obtain, the efficiency of network training can be improved; then, fine training samples (fourth training samples) are used for fine adjustment, so that the trained neural network has higher segmentation precision, and the precision of human body detection and segmentation is improved.

Referring to fig. 9, fig. 9 is a functional unit block diagram of an image filtering apparatus according to an embodiment of the present application. The image filtering apparatus 900 includes: a detection unit 901 and a filtering unit 902, wherein:

the detecting unit 901 is configured to perform human body detection on an image to be filtered to obtain N human body frames and N human body mask patterns, where the N human body frames and the N human body mask patterns are in one-to-one correspondence, and N is an integer greater than or equal to 1;

and a filtering unit 902, configured to filter the image to be filtered according to the N human frames and the N human mask graphs.

In some possible embodiments, the filtering unit 902 is specifically configured to, in filtering the image to be filtered according to the N human frames and the N human mask graphs:

In some possible embodiments, in performing non-maximum suppression on the N human frames according to the confidence level of each human frame and the human mask map corresponding to each human frame, the filtering unit 902 is specifically configured to:

In some possible embodiments, the filtering unit 902 is specifically configured to, in determining a first intersection ratio between the first human mask map and the second human mask map:

In some possible implementations, the filtering unit 902 is specifically configured to, in determining a first intersection ratio between the first human mask map and the second human mask map according to the first mask area, the second mask area, and the third mask area:

Determining a first ratio of the first mask area to the second mask area;

determining a second ratio of the first mask area to the third mask area;

In some possible embodiments, in the case of retaining the image to be filtered, the apparatus further comprises: a sample construction unit 903; wherein, sample construction unit 903 is used for:

In some possible embodiments, in constructing training samples according to the third human mask map and the M human face mask map, the sample construction unit 903 is specifically configured to:

In some possible embodiments, in constructing the training samples according to the second cross-correlation between each face mask map and the third face mask map, the sample construction unit 903 is specifically configured to:

In some possible embodiments, in determining a second cross-correlation between each of the M face mask maps and the third face mask map, the sample construction unit 903 is specifically configured to:

In some possible embodiments, the apparatus further comprises: a first training unit 904; wherein, the first training unit 904 is used for:

In some possible embodiments, in performing human body detection on the image to be filtered to obtain N human body frames and N human body mask graphs, the filtering unit 902 is specifically configured to:

In some possible embodiments, before inputting the to-be-filtered data to the trained neural network to obtain N human frames and N human mask patterns, the image filtering apparatus further includes a second training unit 905, where the second training unit 905 is configured to:

Referring to fig. 10, fig. 10 is a schematic structural diagram of an electronic device according to an embodiment of the present application. As shown in fig. 10, the electronic device 1000 includes a transceiver 1001, a processor 1002, and a memory 1003. Which are connected by a bus 1004. The memory 1003 is used for storing computer programs and data, and the data stored in the memory 1003 can be transferred to the processor 1002.

The processor 1002 is configured to read a computer program in the memory 1003 to perform the steps of any of the method examples described above.

The present application also provides a computer storage medium storing a computer program that is executed by a processor to implement some or all of the steps of any one of the image filtering methods described in the method embodiments above.

Embodiments of the present application also provide a computer program product comprising a non-transitory computer readable storage medium storing a computer program operable to cause a computer to perform some or all of the steps of any one of the image filtering methods described in the method embodiments above.

It should be noted that, for simplicity of description, the foregoing method embodiments are all expressed as a series of action combinations, but it should be understood by those skilled in the art that the present application is not limited by the order of actions described, as some steps may be performed in other order or simultaneously in accordance with the present application. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all alternative embodiments, and that the acts and modules referred to are not necessarily required in the present application.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and for parts of one embodiment that are not described in detail, reference may be made to related descriptions of other embodiments.

In the several embodiments provided in this application, it should be understood that the disclosed apparatus may be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative, such as the division of the units, merely a logical function division, and there may be additional manners of dividing the actual implementation, such as multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, or may be in electrical or other forms.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units described above may be implemented either in hardware or in software program modules.

The integrated units, if implemented in the form of software program modules, may be stored in a computer-readable memory for sale or use as a stand-alone product. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a memory, including several instructions for causing a computer device (which may be a personal computer, a server or a network device, etc.) to perform all or part of the steps of the method described in the embodiments of the present application. And the aforementioned memory includes: a U-disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing program codes.

Those of ordinary skill in the art will appreciate that all or a portion of the steps in the various methods of the above embodiments may be implemented by a program that instructs associated hardware, and the program may be stored in a computer readable memory, which may include: flash disk, read-Only Memory (ROM), random access Memory (Random Access Memory, RAM), magnetic disk or optical disk.

The foregoing has outlined rather broadly the more detailed description of embodiments of the present application, wherein specific examples are provided herein to illustrate the principles and embodiments of the present application, the above examples being provided solely to assist in the understanding of the methods of the present application and the core ideas thereof; meanwhile, as those skilled in the art will have modifications in the specific embodiments and application scope in accordance with the ideas of the present application, the present description should not be construed as limiting the present application in view of the above.

Claims

1. An image filtering method, comprising:

Filtering the images to be filtered according to the N human frames and the N human mask patterns to filter out the images to be filtered of mutual interference among all human bodies, and reserving the images to be filtered containing main human bodies;

the filtering the image to be filtered according to the N human frames and the N human mask patterns includes:

summing pixel values of all pixel points in each human body mask diagram in the K human body mask diagrams to obtain mask areas corresponding to each human body mask diagram in the K human body mask diagrams; the K human body mask map is a human body mask map corresponding to the K human body frames obtained by performing non-maximum suppression on the N human body frames;

determining that the image to be filtered contains a main human body when the third ratio is greater than or equal to a second threshold value, and retaining the image to be filtered when the main human body in the image to be filtered has an associated human face; and filtering the image to be filtered under the condition that the third ratio is smaller than the second threshold value.

2. The method of claim 1, wherein filtering the image to be filtered according to the N human frames and the N human mask maps comprises:

3. The method of claim 2, wherein performing non-maximum suppression on the N human frames according to the confidence level of each human frame and the human mask map corresponding to each human frame to obtain K human frames comprises:

4. The method of claim 3, wherein the determining a first cross-over ratio between the first human mask map and the second human mask map comprises:

5. The method of claim 4, wherein the determining a first cross-over ratio between the first human mask map and the second human mask map based on the first mask area, the second mask area, and the third mask area comprises:

determining a first ratio of the first mask area to the second mask area;

determining a second ratio of the first mask area to the third mask area;

6. The method according to claim 1, wherein in case the image to be filtered is retained, the method further comprises:

7. The method of claim 6, wherein constructing training samples from the third face mask map and the M face mask map comprises:

8. The method of claim 7, wherein constructing training samples based on a second intersection ratio between the each face mask map and the third face mask map comprises:

9. The method of claim 8, wherein the determining a second intersection ratio between each of the M face mask maps and the third face mask map comprises:

10. The method according to claim 9, wherein the method further comprises:

11. The method according to any one of claims 1-10, wherein the performing human detection on the image to be filtered to obtain N human frames and N human mask images includes:

12. The method of claim 11, wherein before inputting the to-be-filtered data to the trained neural network to obtain the N bounding boxes and the N mask graphs, the method further comprises:

13. An image filtering apparatus, comprising:

the filtering unit is used for filtering the images to be filtered according to the N human frames and the N human mask images so as to filter out the images to be filtered of mutual interference among all human bodies and keep the images to be filtered containing main human bodies;

in terms of filtering the image to be filtered according to the N human frames and the N human mask graphs, the filtering unit is specifically configured to:

14. An electronic device, comprising: a processor and a memory, the processor being connected to the memory, the memory being for storing a computer program, the processor being for executing the computer program stored in the memory to cause the electronic device to perform the method of any one of claims 1-12.

15. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program, which is executed by a processor to implement the method of any one of claims 1-12.