CN114399801A

CN114399801A - Target detection method and device

Info

Publication number: CN114399801A
Application number: CN202111447515.2A
Authority: CN
Inventors: 贺克赛; 程新景; 杨睿刚
Original assignee: International Network Technology Shanghai Co Ltd
Current assignee: International Network Technology Shanghai Co Ltd
Priority date: 2021-11-30
Filing date: 2021-11-30
Publication date: 2022-04-26

Abstract

The invention provides a target detection method and a device, and the method comprises the following steps: inputting the acquired picture to be detected into a target detection model to obtain a target detection result output by the target detection model; the target detection model is obtained by training based on a training sample and a corresponding target truth value; the target detection model is used for carrying out specific target detection on the picture to be detected based on the face features extracted from the picture to be detected to obtain a target detection result. According to the method, different target features are respectively extracted through difference, and the specific target feature is detected based on the face feature extracted first, so that the target detection is performed by utilizing the relevance between the face and the specific target, the accuracy of the target detection result is improved, and the target detection efficiency is improved.

Description

Target detection method and device

Technical Field

The present invention relates to the field of image processing technologies, and in particular, to a target detection method and apparatus.

Background

Object detection, also called object extraction, is an image segmentation based on object geometry and statistical features. With the development of computer technology and the wide application of computer vision principle, the real-time tracking research on the target by using the computer image processing technology is more and more popular, and the dynamic real-time tracking and positioning of the target has wide application value in the aspects of intelligent traffic systems, intelligent monitoring systems, military target detection, surgical instrument positioning in medical navigation operations and the like. For example, in some specific monitoring scenes, actions such as answering a call and smoking are not allowed, and therefore, the actions such as smoking and calling in the scenes need to be detected in real time so as to give out early warning in time.

At present, the specific target detection method mostly adopts the modes of increasing input, feature fusion of deep and shallow layers, attention mechanism, small target data oversampling and the like to improve the small target detection precision. However, the specific target has low sample hit in the detection task, and at the same time, the number of pixels is small, and the characteristics are not obvious, so that the loss of the specific target characteristics is easily caused in the down-sampling process, and the detection accuracy is poor.

Disclosure of Invention

The invention provides a target detection method and a target detection device, which are used for solving the defect of missed detection caused by smaller targets in the prior art, realizing the positioning of specific targets and improving the detection precision of the specific targets.

The invention provides a target detection method, which comprises the following steps: inputting the acquired picture to be detected into a target detection model to obtain a target detection result output by the target detection model; the target detection model is obtained by training based on a training sample and a corresponding target truth value; the target detection model is used for carrying out specific target detection on the picture to be detected based on the face features extracted from the picture to be detected to obtain a target detection result.

According to a target detection method provided by the present invention, the target detection result includes a face detection result and a specific target detection result, and the target detection model includes: the human face feature extraction layer is used for extracting features based on the picture to be detected to obtain human face features; the specific target extraction layer is used for extracting the features of the picture to be detected based on the human face features to obtain specific target features; the face detection layer is used for detecting based on the face features to obtain a face detection result; and the specific target detection layer is used for detecting the specific target characteristics to obtain a specific target detection result.

According to the target detection method provided by the invention, the feature extraction of the picture to be detected based on the human face features comprises the following steps: determining a region to be detected of the picture to be detected based on the human face features; and extracting the characteristics of the area to be detected to obtain the specific target characteristics.

According to the target detection method provided by the invention, the training of the target detection model comprises the following steps: acquiring a training sample and a target truth value corresponding to the training sample; and training the model to be trained by taking the training sample as input data used for training and taking the target truth value corresponding to the training sample as a label to obtain the target detection model for generating the target detection result of the picture to be detected.

According to the target detection method provided by the invention, the training of the model to be trained comprises the following steps: inputting the training sample into the model to be trained to obtain a human face prediction result and a specific target prediction result output by the model to be trained; constructing a face loss function according to the face prediction result and a target truth value corresponding to the face prediction result; constructing a specific target loss function according to the specific target prediction result and a target truth value corresponding to the specific target prediction result; and obtaining a total loss function based on the face loss function and the specific target loss function, converging based on the total loss function, and finishing the training.

According to the target detection method provided by the invention, the total loss function is expressed as:

L＝L_T+mL_face

wherein L represents the medium loss function, L_TRepresenting a specific target loss function, L_faceRepresenting the face loss function, and m representing a learnable variable, which is the correlation of a face with a particular target.

The present invention also provides a target detection apparatus, comprising: the target detection module is used for inputting the acquired picture to be detected into a target detection model to obtain a target detection result output by the target detection model; the target detection model is obtained by training based on a training sample and a corresponding target truth value; the target detection model is used for carrying out specific target detection on the picture to be detected based on the face features extracted from the picture to be detected to obtain a target detection result.

The present invention also provides an electronic device, comprising a memory, a processor and a computer program stored on the memory and operable on the processor, wherein the processor implements the steps of any of the above-mentioned object detection methods when executing the program.

The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the object detection method as described in any one of the above.

The invention also provides a computer program product comprising a computer program which, when executed by a processor, performs the steps of the object detection method as described in any one of the above.

According to the target detection method and device provided by the invention, different target features are respectively extracted through different methods, the specific target feature is detected based on the face feature extracted first, so that the target detection is carried out by utilizing the relevance between the face and the specific target, the accuracy of the target detection result is improved, and the target detection efficiency is improved.

Drawings

In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

FIG. 1 is a schematic flow chart of a target detection method provided by the present invention;

FIG. 2 is a schematic diagram of the architecture of a target detection model provided by the present invention;

FIG. 3 is a schematic flow chart of a training target detection model provided by the present invention;

FIG. 4 is a schematic structural diagram of an object detecting device provided in the present invention;

FIG. 5 is a schematic diagram of a training module according to the present invention;

fig. 6 is a schematic structural diagram of an electronic device provided in the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Fig. 1 shows a schematic flow chart of a target detection method of the present invention, which includes:

inputting the acquired picture to be detected into a target detection model to obtain a target detection result output by the target detection model;

the target detection model is obtained by training based on a training sample and a corresponding target truth value;

the target detection model is used for carrying out specific target detection on the picture to be detected based on the face features extracted from the picture to be detected to obtain a target detection result.

The object detection method of the present invention is described below with particular reference to fig. 2-3.

In an optional embodiment, before inputting the acquired picture to be detected into the target detection model, the method further includes: and acquiring the picture to be detected. It should be noted that the acquired picture to be detected may be a picture required for behavior recognition, scene recognition, identity recognition, or other target recognition. For example, when the automatic driving abnormal behavior detection is required, the acquired picture to be detected is derived from a video stream or at least one frame of picture sequence of the vehicle, which is shot at the driving position in real time; for another example, when the autonomous vehicle needs to perform scene recognition, the acquired picture to be detected is derived from a picture sequence acquired by the vehicle in real time for the environment around the vehicle body, at this time, the picture to be detected can be acquired by a radar, a sensor or a camera of the vehicle body, and the source of the picture to be detected is not further limited.

In this embodiment, the acquired picture to be detected is input into the target detection model, so as to obtain a target detection result output by the target detection model. Taking the detection of specific targets such as cigarettes and mobile phones as an example, because the pixels of the articles such as the cigarettes and the mobile phones are too small, the detection model aiming at a certain target (such as the cigarettes and the mobile phones) is not suitable for detection, and the smoking and calling behaviors are all related to the human face, namely the smoking and calling behaviors are all realized by depending on people, so the specific targets such as the cigarettes and the mobile phones can be detected by depending on the human face which is easy to recognize, the model detection precision is improved, and the overfitting is avoided.

Specifically, the target detection result includes a face detection result and a specific target detection result, and the target detection model includes: the human face feature extraction layer is used for extracting features based on the picture to be detected to obtain human face features; the specific target extraction layer is used for extracting the features of the picture to be detected based on the human face features to obtain specific target features; the face detection layer is used for detecting based on the face features to obtain a face detection result; and the specific target detection layer is used for detecting the specific target characteristics to obtain a specific target detection result.

It should be noted that, after the acquired image to be detected is input into the target detection model, corresponding features are extracted based on the specific feature extraction layer of the target detection model, so that detection is performed according to the extracted features to obtain corresponding targets, thereby realizing multi-target detection and greatly improving detection efficiency. Referring to fig. 2, the target detection model includes a plurality of convolutional layers, a specific convolutional layer is selected to extract a corresponding target feature, and the specific convolutional layer corresponding to the convolutional layer is selected to detect the extracted target feature, so as to obtain a target detection result. It should be noted that different targets are respectively subjected to feature extraction and target detection corresponding to different convolutional layers.

Firstly, a human face feature extraction layer performs feature extraction on a picture to be detected to obtain human face features, and a specific target extraction layer performs feature extraction on the picture to be detected to obtain specific target features on the basis of the human face features. It should be noted that, because the human face is easier to be identified than other specific targets, such as a cigarette, a mobile phone, and the like, when performing feature extraction, the human face feature extraction layer will extract human face features first, and then quickly locate the specific targets, such as a cigarette, a mobile phone, and the like, by using the association between the human face and the specific targets, such as a cigarette, a mobile phone, and the like, to quickly extract the specific target features, thereby facilitating improvement of detection efficiency.

Furthermore, when feature extraction is performed on the picture to be detected based on the face features, the method includes: determining a region to be detected of the picture to be detected based on the human face characteristics; and extracting the characteristics of the area to be detected to obtain the specific target characteristics.

Secondly, the face detection layer carries out detection based on the face features to obtain a face detection result; and the specific target detection layer is used for detecting the specific target characteristics to obtain a specific target detection result. It should be noted that the human face features extracted by the human face feature extraction layer are detected by the human face detection layer, and the specific target features extracted by the specific target are detected by the specific target detection layer, so that different targets can be detected by using different feature layers, thereby improving the efficiency and accuracy of target detection.

In an alternative embodiment, referring to FIG. 3, training the target detection model includes:

s31, acquiring a training sample and a corresponding target truth value;

and S32, training the model to be trained by taking the training sample as input data for training and the target true value corresponding to the training sample as a label to obtain the target detection model for generating the target detection result of the picture to be detected.

It should be noted that S3N in this specification does not represent the order of training the target detection model.

Step S31, a training sample and a corresponding target true value are obtained.

In this embodiment, obtaining the training samples and the corresponding target truth values includes: collecting training videos or images, and screening out videos or images containing face information by using a face detection method to serve as effective training samples; and labeling the effective training samples to obtain labels of the human faces and corresponding bounding box information, and obtain labels of specific targets such as cigarettes, telephones and the like and corresponding bounding box information.

It is necessary to supplement that, when training videos or images are collected, driving behaviors of different drivers can be recorded under different vehicle driving environments, behavior videos of different drivers smoking or playing mobile phones are recorded, and videos of different drivers not smoking or playing mobile phones are recorded as normal samples. In addition, images downloaded from the internet or taken of different specific targets can also be used as training samples.

In order to establish the relevance between a specific target and a person and consider the redundancy between continuous video frames, when screening collected training videos, collecting 1 frame of video files every few frames and processing the video files by using a face detection algorithm so as to remove the videos which do not contain face information; when the collected images are screened, a human face detection algorithm is directly used for processing so as to remove the images which do not contain human face information.

When effective training samples are marked, specific targets such as a human face, cigarettes and a mobile phone are marked, namely when the human face and the specific targets appear in an image, the frame of the image is marked, and the label is set to be the human face, the cigarettes and/or the mobile phone correspondingly.

In an optional embodiment, after obtaining the training samples and their corresponding target truth values, the method further includes: and performing data enhancement on the training sample by using a data enhancement strategy. The data enhancement strategy comprises image scaling, horizontal mirror image turning, random brightness and tone adjustment and the like, and the label information of each target is kept unchanged, and meanwhile, the coordinate information of the bounding box is updated according to a corresponding geometric transformation method.

And step S32, training the model to be trained by taking the training sample as input data for training and the target truth value corresponding to the training sample as a label to obtain the target detection model for generating the target detection result of the picture to be detected.

In this embodiment, the network to be trained may be an existing network built in the training apparatus, and the existing network generally includes a network structure, or may be another network specified by the user, such as the target detection network FPN. The network to be trained generally comprises a feature extraction layer for extracting corresponding target features, a target detection layer for correspondingly detecting each extracted target feature and a loss function; and inputting the training sample or the training sample subjected to data enhancement into a model to be trained for training according to a preset iteration rule to obtain a trained target detection model.

Specifically, training a model to be trained includes: inputting the training sample into a model to be trained to obtain a face prediction result and a specific target prediction result output by the model to be trained; constructing a face loss function according to the face prediction result and a target truth value corresponding to the face prediction result; constructing a specific target loss function according to the specific target prediction result and a target true value corresponding to the specific target prediction result; and obtaining a total loss function based on the face loss function and the specific target loss function, converging based on the total loss function, and finishing the training.

The total loss function, expressed as:

L＝L_T+mL_face

In summary, in the embodiments of the present invention, different target features are extracted respectively through different methods, and the specific target feature is detected based on the face feature extracted first, so as to perform target detection by using the correlation between the face and the specific target, thereby improving the accuracy of the target detection result and improving the efficiency of target detection.

The object detection device provided by the present invention is described below, and the object detection device described below and the object detection method described above may be referred to in correspondence with each other.

Fig. 4 shows a schematic structural diagram of an object detection device, which comprises:

the target detection module 41 is configured to input the acquired to-be-detected picture into a target detection model to obtain a target detection result output by the target detection model;

In an optional embodiment, the apparatus further includes a data obtaining module, configured to obtain the picture to be detected. It should be noted that the acquired picture to be detected may be a picture required for behavior recognition, scene recognition, identity recognition, or other target recognition. For example, when the automatic driving abnormal behavior detection is required, the acquired picture to be detected is derived from a video stream or at least one frame of picture sequence of the vehicle, which is shot at the driving position in real time; for another example, when the autonomous vehicle needs to perform scene recognition, the acquired picture to be detected is derived from a picture sequence acquired by the vehicle in real time for the environment around the vehicle body, at this time, the picture to be detected can be acquired by a radar, a sensor or a camera of the vehicle body, and the source of the picture to be detected is not further limited.

In this embodiment, the target detection module 41 is utilized to input the acquired to-be-detected picture into the target detection model, so as to obtain a target detection result output by the target detection model. Taking the detection of specific targets such as cigarettes and mobile phones as an example, because the pixels of the articles such as the cigarettes and the mobile phones are too small, the detection model aiming at a certain target (such as the cigarettes and the mobile phones) is not suitable for detection, and the smoking and calling behaviors are all related to the human face, namely the smoking and calling behaviors are all realized by depending on people, so the specific targets such as the cigarettes and the mobile phones can be detected by depending on the human face which is easy to recognize, the model detection precision is improved, and the overfitting is avoided.

Specifically, the target detection module 41 includes: the face feature extraction unit is used for extracting features based on the picture to be detected to obtain face features; the specific target extraction unit is used for extracting the features of the picture to be detected based on the human face features to obtain specific target features; the face detection unit is used for detecting based on the face features to obtain a face detection result; and the specific target detection unit is used for detecting the specific target characteristics to obtain a specific target detection result.

It should be noted that, after the acquired image to be detected is input into the target detection model, corresponding features are extracted based on the specific feature extraction layer of the target detection model, so that detection is performed according to the extracted features to obtain corresponding targets, thereby realizing multi-target detection and greatly improving detection efficiency.

Still further, the specific object extracting unit includes: the region selection subunit determines a region to be detected of the picture to be detected based on the human face characteristics; and the characteristic extraction subunit is used for extracting the characteristics of the area to be detected to obtain the specific target characteristics.

In an alternative embodiment, referring to fig. 5, the apparatus further comprises a training module for training the object detection model, the training module comprising:

a sample obtaining unit 51, configured to obtain a training sample and a corresponding target truth value;

the training unit 52 trains the model to be trained by using the training sample as input data for training and using the target truth value corresponding to the training sample as a label, so as to obtain a target detection model for generating a target detection result of the picture to be detected.

In the present embodiment, the sample acquiring unit 51 includes: the data acquisition subunit acquires training videos or images, and screens the videos or images containing face information by using a face detection method to serve as effective training samples; and the labeling subunit is used for labeling the effective training samples to obtain labels of the human faces and corresponding bounding box information, and obtain labels of specific targets such as cigarettes, telephones and the like and corresponding bounding box information.

In an optional embodiment, the training module further comprises a data enhancement unit, configured to perform data enhancement on the training sample by using a data enhancement strategy. The data enhancement strategy comprises image scaling, horizontal mirror image turning, random brightness and tone adjustment and the like, and the label information of each target is kept unchanged, and meanwhile, the coordinate information of the bounding box is updated according to a corresponding geometric transformation method.

The training unit 52 trains the model to be trained by using the training sample as input data for training and using the target truth value corresponding to the training sample as a label, so as to obtain a target detection model for generating a target detection result of the picture to be detected. It should be noted that the network to be trained may be an existing network built in the training apparatus, and the existing network generally includes a network structure, or may be another network specified by the user, such as the target detection network FPN. The network to be trained generally comprises a feature extraction layer for extracting corresponding target features, a target detection layer for correspondingly detecting each extracted target feature and a loss function; and inputting the training sample or the training sample subjected to data enhancement into a model to be trained for training according to a preset iteration rule to obtain a trained target detection model.

A training unit 52 comprising: the training subunit inputs the training sample into the model to be trained to obtain a human face prediction result and a specific target prediction result output by the model to be trained; the first loss function acquisition subunit constructs a face loss function according to the face prediction result and a target truth value corresponding to the face prediction result; the second loss function acquisition subunit constructs a specific target loss function according to the specific target prediction result and a target true value corresponding to the specific target prediction result; and the total loss function obtaining subunit obtains a total loss function based on the face loss function and the specific target loss function, converges based on the total loss function, and ends the training.

It should be noted that when the total loss function is obtained, the total loss function is obtained according to the sum of the specific target loss function and the product of the face loss function and the variable learning amount, so that the specific targets such as cigarettes and mobile phones can be learned conveniently by using the face features extracted first in the training process of the model to be trained, and the detection accuracy of the model is improved.

Fig. 6 illustrates a physical structure diagram of an electronic device, which may include, as shown in fig. 6: a processor (processor)61, a communication Interface (communication Interface)62, a memory (memory)63 and a communication bus 64, wherein the processor 61, the communication Interface 62 and the memory 63 complete communication with each other through the communication bus 64. Processor 61 may invoke logic instructions in memory 63 to perform a target detection method comprising: inputting the acquired picture to be detected into a target detection model to obtain a target detection result output by the target detection model; the target detection model is obtained by training based on a training sample and a corresponding target truth value; the target detection model is used for carrying out specific target detection on the picture to be detected based on the face features extracted from the picture to be detected to obtain a target detection result.

Furthermore, the logic instructions in the memory 63 may be implemented in the form of software functional units and stored in a computer readable storage medium when sold or used as a stand-alone product. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

In another aspect, the present invention also provides a computer program product, the computer program product comprising a computer program, the computer program being storable on a non-transitory computer-readable storage medium, the computer program, when executed by a processor, being capable of executing the object detection method provided by the above methods, the method comprising: inputting the acquired picture to be detected into a target detection model to obtain a target detection result output by the target detection model; the target detection model is obtained by training based on a training sample and a corresponding target truth value; the target detection model is used for carrying out specific target detection on the picture to be detected based on the face features extracted from the picture to be detected to obtain a target detection result.

In yet another aspect, the present invention also provides a non-transitory computer-readable storage medium, on which a computer program is stored, the computer program, when executed by a processor, implementing an object detection method provided by the above methods, the method including: inputting the acquired picture to be detected into a target detection model to obtain a target detection result output by the target detection model; the target detection model is obtained by training based on a training sample and a corresponding target truth value; the target detection model is used for carrying out specific target detection on the picture to be detected based on the face features extracted from the picture to be detected to obtain a target detection result.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method of object detection, comprising:

2. The object detection method of claim 1, wherein the object detection result comprises a face detection result and a specific object detection result, and the object detection model comprises:

the human face feature extraction layer is used for extracting features based on the picture to be detected to obtain human face features;

the specific target extraction layer is used for extracting the features of the picture to be detected based on the human face features to obtain specific target features;

the face detection layer is used for detecting based on the face features to obtain a face detection result;

and the specific target detection layer is used for detecting the specific target characteristics to obtain a specific target detection result.

3. The target detection method according to claim 2, wherein the extracting the features of the picture to be detected based on the facial features comprises:

determining a region to be detected of the picture to be detected based on the human face features;

and extracting the characteristics of the area to be detected to obtain the specific target characteristics.

4. The method of claim 1, wherein training the object detection model comprises:

acquiring a training sample and a target truth value corresponding to the training sample;

and training the model to be trained by taking the training sample as input data used for training and taking the target truth value corresponding to the training sample as a label to obtain the target detection model for generating the target detection result of the picture to be detected.

5. The method of claim 4, wherein the training the model to be trained comprises:

inputting the training sample into the model to be trained to obtain a human face prediction result and a specific target prediction result output by the model to be trained;

constructing a face loss function according to the face prediction result and a target truth value corresponding to the face prediction result;

constructing a specific target loss function according to the specific target prediction result and a target truth value corresponding to the specific target prediction result;

and obtaining a total loss function based on the face loss function and the specific target loss function, converging based on the total loss function, and finishing the training.

6. The object detection method of claim 5, wherein the total loss function is expressed as:

L＝L_T+mL_face

7. An object detection device, comprising:

the target detection module is used for inputting the acquired picture to be detected into a target detection model to obtain a target detection result output by the target detection model;

8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the object detection method according to any of claims 1 to 6 are implemented when the processor executes the program.

9. A non-transitory computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the object detection method according to any one of claims 1 to 6.

10. A computer program product comprising a computer program, characterized in that the computer program realizes the steps of the object detection method according to any one of claims 1 to 6 when executed by a processor.