CN112528977B - Target detection method, target detection device, electronic equipment and storage medium - Google Patents

Target detection method, target detection device, electronic equipment and storage medium Download PDF

Info

Publication number
CN112528977B
CN112528977B CN202110180875.4A CN202110180875A CN112528977B CN 112528977 B CN112528977 B CN 112528977B CN 202110180875 A CN202110180875 A CN 202110180875A CN 112528977 B CN112528977 B CN 112528977B
Authority
CN
China
Prior art keywords
feature
fusion
picture
detection
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110180875.4A
Other languages
Chinese (zh)
Other versions
CN112528977A (en
Inventor
李超超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Youmu Technology Co ltd
Original Assignee
Beijing Youmu Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Youmu Technology Co ltd filed Critical Beijing Youmu Technology Co ltd
Priority to CN202110180875.4A priority Critical patent/CN112528977B/en
Publication of CN112528977A publication Critical patent/CN112528977A/en
Application granted granted Critical
Publication of CN112528977B publication Critical patent/CN112528977B/en
Priority to PCT/CN2021/111385 priority patent/WO2022170742A1/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/25Determination of region of interest [ROI] or a volume of interest [VOI]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/46Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
    • G06V10/462Salient features, e.g. scale invariant feature transforms [SIFT]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Image Analysis (AREA)

Abstract

The embodiment of the invention provides a target detection method, a target detection device, electronic equipment and a storage medium, wherein the target detection method comprises the following steps: acquiring a feature extraction network, a feature fusion network and a multi-task detector which are obtained by utilizing a fusion picture set to perform model training, wherein the fusion picture set is obtained by utilizing each single-task detection model to detect and mark the position, the type, the attribute and the key point of a preset target in each picture in a preset picture set; extracting the features of the picture to be detected through a feature extraction network to obtain a plurality of original feature maps with different sizes; performing feature fusion on the original feature maps through a feature fusion network to obtain a first preset number of fusion feature maps with different sizes; and performing feature detection on the fusion feature map through a multitask detector to obtain the position, type, attribute and key point of the target to be detected. The embodiment of the invention can reduce the workload of model training, realize multi-task one-stop detection and improve the detection efficiency.

Description

Target detection method, target detection device, electronic equipment and storage medium
Technical Field
The embodiment of the invention relates to the technical field of computers, in particular to a target detection method, a target detection device, electronic equipment and a storage medium.
Background
According to traditional target detection, a detector can only achieve detection of a single task, for example, the position of a detected target, a plurality of detectors need to be trained to achieve detection of a plurality of tasks at the same time, the plurality of tasks are sequentially and serially carried out by the aid of the plurality of detectors, model training workload is large, and overall detection efficiency is low.
Disclosure of Invention
The embodiment of the invention provides a target detection method, a target detection device, electronic equipment and a storage medium, which can reduce the workload of model training, realize multi-task one-stop detection and improve the detection efficiency.
In a first aspect, an embodiment of the present invention provides a target detection method, including:
acquiring a feature extraction network, a feature fusion network and a multi-task detector which are obtained by utilizing a fusion picture set to perform model training, wherein the fusion picture set is obtained by utilizing each single-task detection model to detect and mark the position, the type, the attribute and the key point of a preset target in each picture in a preset picture set;
extracting the features of the picture to be detected through the feature extraction network to obtain a plurality of original feature maps with different sizes;
performing feature fusion on the original feature maps through the feature fusion network to obtain a first preset number of fusion feature maps with different sizes;
and performing feature detection on the fusion feature map through the multitask detector to obtain the position, type, attribute and key point of the target to be detected.
In a second aspect, an embodiment of the present invention provides an object detection apparatus, including:
the system comprises an acquisition module, a feature extraction network, a feature fusion network and a multi-task detector, wherein the acquisition module is used for acquiring the feature extraction network, the feature fusion network and the multi-task detector which are obtained by utilizing a fusion picture set to carry out model training, and the fusion picture set is obtained by utilizing each single-task detection model to detect and mark the position, the type, the attribute and a key point of a preset target in each picture in the preset picture set;
the extraction module is used for extracting the features of the picture to be detected through the feature extraction network to obtain a plurality of original feature maps with different sizes;
the fusion module is used for carrying out feature fusion on the original feature maps through the feature fusion network to obtain a first preset number of fusion feature maps with different sizes;
and the detection module is used for carrying out feature detection on the fusion feature map through the multitask detector to obtain the position, the type, the attribute and the key point of the target to be detected.
In a third aspect, an embodiment of the present invention provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor executes the computer program to implement the object detection method according to the embodiment of the present invention.
In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements an object detection method according to an embodiment of the present invention.
In the embodiment of the invention, a model trained based on a fused picture set for multi-task detection can be obtained, the fused picture set is obtained by detecting and marking the position, the type, the attribute and the key point of a preset target in each picture in the preset picture set by utilizing each single-task detection model, multi-task one-stop detection is realized based on the model obtained by training, and the detection efficiency is improved; in addition, the model training of the embodiment of the invention is based on the fusion picture set, the multi-task detector is trained in one step, and a plurality of detectors for detecting a plurality of tasks do not need to be trained respectively, so that the workload of the model training is reduced; furthermore, in the detection process, the multi-task shares the feature extraction network and the feature fusion network, so that the network utilization rate is improved, the calculated amount is reduced, and the overall detection efficiency is improved; in addition, target detection is carried out based on the fusion characteristic graphs of different sizes, targets of different sizes can be detected, and detection accuracy is improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.
Fig. 1 is a schematic flow chart of a target detection method according to an embodiment of the present invention.
Fig. 2 is a schematic sub-flow chart of a target detection method according to an embodiment of the present invention.
Fig. 3 is a schematic diagram illustrating an effect of the target detection method according to the embodiment of the present invention.
Fig. 4 is a flowchart illustrating an obtaining method of a fused picture set according to an embodiment of the present invention.
Fig. 5 is a schematic diagram illustrating an effect of a training picture according to an embodiment of the present invention.
Fig. 6 is a schematic flowchart of a model training method according to an embodiment of the present invention.
Fig. 7 is a network diagram of a model training process according to an embodiment of the present invention.
Fig. 8 is a schematic structural diagram of an object detection apparatus according to an embodiment of the present invention.
Fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.
Fig. 1 is a schematic flowchart of an object detection method according to an embodiment of the present invention, which may be implemented by an object detection apparatus according to an embodiment of the present invention, where the apparatus may be implemented in software and/or hardware. In a specific embodiment, the apparatus may be integrated into an electronic device, and the electronic device may be a mobile phone, a Personal Computer (PC), a tablet Computer, a notebook Computer, a desktop Computer, or the like. The following embodiments will be described taking as an example the integration of the device in an electronic apparatus. Referring to fig. 1, the method may specifically include the following steps:
step 101, obtaining a feature extraction network, a feature fusion network and a multi-task detector which are obtained by performing model training by using a fusion picture set, wherein the fusion picture set is obtained by detecting and marking the position, the type, the attribute and the key point of a preset target in each picture in the preset picture set by using each single-task detection model.
For example, the single-task detection model may be a model for single-task detection, the single-task detection model may be an existing model selected according to actual detection requirements, or the single-task detection model may also be a model trained according to actual detection requirements. The preset target may be set according to an actual detection requirement, for example, if a model for performing target detection on a speech scene is to be trained, the preset target may include a human face and a hand. Illustratively, the single-task detection model may include, for example: the system comprises a model for detecting and positioning the human face, a model for detecting and positioning gestures, a model for detecting the attribute of the human face and a model for detecting key points of the human face. Wherein, the attributes of the face such as expression, eye spirit, gender, age, etc., and the key points of the face such as eyes, mouth, nose, etc.
Specifically, the preset picture set may be a set formed by a large number of pictures collected from an open platform and related to a scene to be detected, and after each picture in the preset picture set is detected and labeled by each single task detection model, each picture in the fused picture set is a picture labeled with a position, a type, an attribute and a key point of a target.
And 102, performing feature extraction on the picture to be detected through a feature extraction network to obtain a plurality of original feature maps with different sizes.
For example, the picture to be detected may be an independent picture, or may be a series of pictures taken from a video, for example, the picture to be detected may be from a video generated by real-time recording, or may be from a pre-generated video.
Specifically, the pictures to be detected can be input into the feature extraction network, so that pooling and sampling operations are performed on each picture by using the feature extraction network, and a plurality of original feature maps with different sizes corresponding to the pictures to be detected are obtained. In a specific embodiment, the feature graphs extracted by the feature extraction network are sequentially reduced in size from top to bottom, and in order to improve the processing efficiency, a first preset number of feature graphs can be selected from the middle-lower layer feature graphs extracted by the feature extraction network to serve as original feature graphs. The first preset number can be a user-defined value according to an actual situation, and can be, for example, 2 or 3.
And 103, performing feature fusion on the original feature maps through a feature fusion network to obtain a first preset number of fusion feature maps with different sizes.
The original feature maps with different sizes are subjected to feature fusion, so that the semantics of the feature maps can be enhanced, and the prediction accuracy is improved. The specific feature fusion method may, for example, unify the number of channels of each original feature map, start from the original feature map at the lowest layer, perform an upsampling operation on the original feature map at the lower layer, and then perform feature addition, convolution fusion and other operations on the original feature map at the adjacent upper layer, so as to obtain a fused feature map.
And 104, performing feature detection on the fusion feature map through a multi-task detector to obtain the position, type, attribute and key point of the target to be detected.
Specifically, the method for detecting features by using a multi-tasking detector can be shown in fig. 2, and includes the following steps:
step 1041, inputting the fused feature map into a multitask detector to obtain predicted output data.
And 1042, decoding the prediction output data to obtain prediction frame data.
The prediction box data may include prediction box position data, which may include center point coordinates and width and height of the prediction box, type data, which may include type names (such as face and hand) and type confidence, attribute data, which may include coordinates of respective key points, and attribute data, which may include attribute names (such as smile, anger, embarrassment) and attribute confidence.
Step 1043, filtering the prediction frame data to obtain target frame data;
for example, the prediction box data with the confidence lower than the preset confidence threshold may be filtered, and the NMS algorithm may be applied to the remaining prediction box data to filter the prediction box data with a large overlap degree, so as to finally obtain the target box data.
And step 1044, marking the original characteristic diagram according to the target frame data to obtain the position, type, attribute and key point of the target to be detected.
When the picture to be detected comes from the video, the target detection method is sequentially executed for each image in the video, and dynamic and continuous detection results can be seen in the video.
For example, for a speech video, the detection results of various speech indexes can be dynamically displayed in the video through the detection method provided by the embodiment of the invention, and for a real-time speech scene, a speaker can adjust the speaker according to the detection results presented in the video, so as to present a good speech effect. Furthermore, the speaker can be scored according to the detection result.
In a specific embodiment, as shown in fig. 3, the detection result presented by using the target detection method provided by the embodiment of the present invention may be represented by a picture in which a detection frame of a Face and a type thereof (Face), a detection frame of a gesture and a type thereof (Hand) are presented, and attributes (e.g., expression) and key points (e.g., eyes, nose, mouth) of the Face are labeled in the detection frame of the Face, and numbers in the figure represent confidence.
In the embodiment of the invention, a method for detecting and labeling the preset picture set by using a plurality of single-task detection models can be used for obtaining the fusion picture set, a model for multi-task detection is trained based on the fusion picture set, multi-task one-stop detection is realized based on the trained model, and the detection efficiency is improved; in addition, the model training of the embodiment of the invention is based on the fusion picture set, the multi-task detector is trained in one step, and a plurality of detectors for detecting a plurality of tasks do not need to be trained respectively, so that the workload of the model training is reduced, and convenience is provided for the model transformation and deployment; furthermore, in the detection process, the multi-task shares the feature extraction network and the feature fusion network, so that the network utilization rate is improved, the calculated amount is reduced, and the overall detection efficiency is improved; in addition, target detection is carried out based on the fusion characteristic graphs of different sizes, targets of different sizes can be detected, and detection accuracy is improved.
In a specific embodiment, as shown in fig. 4, the fused picture set can be obtained as follows:
step 201, detecting and marking the face and the position in each picture in the preset picture set by using a face detection and positioning model.
For example, the face detection and positioning model may be used to detect the face in each picture in the preset picture set, and after the face is detected, the position of the face may be framed and the type may be labeled.
Step 202, detecting and marking the gesture and the position in each picture in the preset picture set by using a gesture detection and positioning model.
For example, a hand in each picture in the preset picture set may be detected by using the gesture detection and positioning model, and after the hand is detected, a frame may be added to the position of the hand and the type may be labeled.
And 203, detecting and marking the face attribute in each picture in the preset picture set by using the face attribute detection model.
For example, the face attribute detection model may be used to detect the face attributes in each picture in the preset picture set, such as expression (e.g., smile, laugh, embarrassment), gaze (e.g., point view, virtual view, and circular view), gender, age, and the like, and mark the attributes at the corresponding positions of the pictures according to the detection results.
And 204, detecting the human face key points in each picture in the preset picture set by using a human face key point detection model.
For example, a face key point detection model may be used to detect face key points, such as eyes, mouth, nose, etc., in each picture in a preset picture set, and mark key points at corresponding positions of the pictures according to the detection result.
After the position, the type, the attribute and the key point of a preset target in each picture in a preset picture set are detected and marked, a fused picture set is obtained. For example, as shown in fig. 5, one picture in the fused picture set may be shown, in fig. 5, the labeled type includes Face and Hand handle, the labeled attribute is expression No-smile, and the labeled key point is eye.
In a specific embodiment, as shown in fig. 6, the method for obtaining the feature extraction network, the feature fusion network, and the multi-task detector by using the fusion image set to perform model training may be as follows:
step 301, determining a real frame of a preset target according to the label in the fusion picture set.
In a specific embodiment, a preset target corresponds to a real frame group route, and the determined real frame may include a position tag (which may include a center point coordinate and a width and a height of the real frame), a type tag, an attribute tag, and a key point tag of the real frame.
Step 302, inputting each picture in the fused picture set into an initial feature extraction network to obtain a plurality of original training feature maps with different sizes corresponding to each picture.
In a specific implementation, the initial feature extraction network may be a lightweight feature extraction network, including but not limited to networks such as MobileNet, ShuffleNet, SqueezeNet, and the like, and various iterative versions thereof. In order to meet the requirements of multiple aspects such as detection speed, model size, detection precision and the like, in the embodiment of the invention, the MobileNet-V2 can be selected as an initial feature extraction network, and the width scaling factor of the MobileNet-V2 is set to be 0.35, so that the parameter quantity of the initial feature extraction network is reduced to about 30M, and a foundation is laid for realizing the lightweight and low-delay performance of a mobile terminal model.
Exemplarily, each picture in the fused picture set is input into an initial feature extraction network, so that pooling and sampling operations are performed on each picture by using the initial feature extraction network, and a plurality of original training feature maps with different sizes corresponding to each picture are obtained.
For example, the size of the picture input into the initial feature extraction network is (224, 224, 3), and after feature extraction by the initial feature extraction network, feature maps of 1/2, 1/4, 1/8, 1/16, and 1/32, i.e., feature maps of (112, 112, 16), (56, 56, 24), (28, 28, 32), (14, 14, 96), (7, 7, 160) of the original image, can be obtained, where a large-size feature map can be used to detect a small-size object, and a small-size feature map can be used to detect a large-size object. In order to improve the processing efficiency, the feature map of the middle and lower layers extracted by the initial feature extraction network may be selected as the original training feature map, and for example, the feature maps with the sizes of (28, 28, 32), (14, 14, 96), (7, 7, 160) may be used as the original training feature map.
Step 303, inputting the original training feature maps into the initial feature fusion network to obtain a first preset number of fusion training feature maps with different sizes corresponding to each picture.
For example, the initial feature fusion network may be a feature pyramid network, that is, the feature pyramid network may be used to perform feature fusion on original training feature maps of different sizes, so as to enhance the semantics of the feature maps, and obtain a first preset number of fused training feature maps of different sizes.
For example, as shown in fig. 7, the original training feature maps are (28, 28, 32), (14, 14, 96) and (7, 7, 160), the original training feature maps with the scales of (28, 28, 32), (14, 14, 96) and (7, 7, 160) may be first convolved by 1 × 1, the number of channels may be uniformly adjusted to 64, feature maps with the scales of (28, 28, 64), (14, 14, 64) and (7, 7, 64) may be obtained, and the feature map with the scale of (7, 7, 64) may be directly used as one fused training feature map.
And performing up-sampling operation on the original training feature map with the size of (7, 7, 64) to obtain a feature map with the size of (14, 14, 64), and performing feature addition and convolution fusion on the feature map with the size of (14, 14, 64) and the feature map with the size of (14, 14, 64) obtained after channel adjustment to obtain a fused training feature map with the size of (14, 14, 64).
Similarly, the feature map with the size of (28, 28, 64) obtained after the up-sampling operation is performed on the feature map with the size of (14, 14, 64) after the fusion, and the feature map with the size of (28, 28, 64) obtained after the channel adjustment is subjected to feature addition and convolution fusion to obtain the fusion training feature map with the size of (28, 28, 64). That is, in the example of fig. 7, there are 3 output fused training feature maps.
It should be noted that the above-described feature fusion method is only an example, and in practical applications, other feature fusion methods may also be adopted, and are not specifically limited herein.
And step 304, setting a second preset number of different-size prior frames at each pixel point of the fusion training feature map.
In one possible implementation, the prior box may be a rectangular box, and the prior box may be set as follows: determining the size of the fusion training feature maps, determining the prior frame size for each fusion training feature map according to the preset relation between the prior frame size and the feature map size, and then setting the prior frames with different sizes in a second preset number on each pixel point of the corresponding fusion training feature maps according to the prior frame size determined for each fusion training feature map and the preset prior frame number (namely, the second preset number). The second preset number can be a user-defined value according to an actual situation, and can be, for example, 2 or 3. In the example shown in fig. 7, the number of the prior frames set on each pixel point of the fused training feature map is 2.
In some embodiments, the prior frame may also be set in other manners, such as according to the size of a preset target, which is not specifically limited herein.
Step 305, calculating the intersection ratio of each prior frame and the real frame.
For example, the intersection ratio may be calculated according to the following formula:
Figure 31582DEST_PATH_IMAGE001
wherein the content of the first and second substances,
Figure 417564DEST_PATH_IMAGE002
denotes the cross-over ratio, a denotes the size of the prior box, and B denotes the size of the real box.
And step 306, selecting a suggestion box from the prior check boxes according to the intersection ratio.
Because each pixel point of the fusion training characteristic diagram is provided with a second preset number of prior frames with different sizes, the number of the prior frames is large, in order to improve the processing efficiency, a suggested frame can be selected from the prior frames according to the intersection ratio, and the specific selection method can be as follows:
(1) and selecting a prior frame with the intersection ratio larger than a preset threshold as a suggestion frame, wherein the preset threshold can be preset according to the actual situation, for example, the preset threshold can be 0.6, 0.7 and the like.
(2) And selecting the prior frame with the largest intersection ratio as a suggestion frame.
Specifically, each preset target may correspond to one real box and at least one suggestion box, and in the training stage, target detection may be performed based on the suggestion box corresponding to each preset target.
And 307, inputting the fused training feature map into the initial multitask detector, so that the initial multitask detector performs feature detection based on the suggestion box to obtain training output data.
Step 308, calculating a loss function based on the training output data and the real box.
Specifically, the real frame may be understood as a frame on the original image, and the real frame may be encoded on the fused training feature map, and then the type confidence loss function may be calculated according to the training output data and the encoded real frame
Figure 915410DEST_PATH_IMAGE003
Target position offset penalty function
Figure 779461DEST_PATH_IMAGE004
Shift loss function of key point position
Figure 328254DEST_PATH_IMAGE005
And attribute confidence loss function
Figure 885137DEST_PATH_IMAGE006
(ii) a According to the above
Figure 886591DEST_PATH_IMAGE003
Figure 819912DEST_PATH_IMAGE004
Figure 223212DEST_PATH_IMAGE005
And
Figure 433220DEST_PATH_IMAGE006
calculating a loss function
Figure 921970DEST_PATH_IMAGE007
Wherein the content of the first and second substances,
Figure 393403DEST_PATH_IMAGE008
wherein the content of the first and second substances,
Figure 651209DEST_PATH_IMAGE009
the number of types, particularly in embodiments of the invention,
Figure 815474DEST_PATH_IMAGE009
the value is 3, namely the types are 3 types, namely the types are respectively a human face, a hand and a background,
Figure 791520DEST_PATH_IMAGE010
indicates whether it belongs to the predicted value (0 or 1) of the ith type,
Figure 801064DEST_PATH_IMAGE011
representing predicted probability values belonging to the ith type;
Figure 428224DEST_PATH_IMAGE012
Figure 232232DEST_PATH_IMAGE013
wherein the content of the first and second substances,
Figure 695574DEST_PATH_IMAGE014
the coordinate value quantity corresponding to the upper left corner and the lower right corner of the representation frame,
Figure 508809DEST_PATH_IMAGE014
the value is usually 4, which is the number,
Figure 741208DEST_PATH_IMAGE015
a function representing the loss of smoothness is represented,
Figure 247275DEST_PATH_IMAGE016
indicating the prediction offset to which the suggestion box corresponds,
Figure 932334DEST_PATH_IMAGE017
indicating the true offset to which the suggestion box corresponds,
Figure 299993DEST_PATH_IMAGE018
as a difference between the real coordinates and the predicted coordinates of the preset target,
Figure 386898DEST_PATH_IMAGE016
can be obtained according to the position data of the output frame corresponding to the suggestion frame and the position data corresponding to the real frame,
Figure 798288DEST_PATH_IMAGE017
the position data of the suggestion frame and the position data corresponding to the real frame can be obtained;
Figure 970643DEST_PATH_IMAGE019
Figure 125681DEST_PATH_IMAGE020
wherein the content of the first and second substances,
Figure 332671DEST_PATH_IMAGE021
the number of coordinate values representing the key points,
Figure 180541DEST_PATH_IMAGE022
representing the keypoint prediction offset to which the suggestion box corresponds,
Figure 823881DEST_PATH_IMAGE023
representing the true offsets of the keypoints to which the suggestion boxes correspond,
Figure 782610DEST_PATH_IMAGE024
the difference value of the real coordinate and the predicted coordinate of the key point is obtained;
Figure 578528DEST_PATH_IMAGE025
wherein the content of the first and second substances,
Figure 331720DEST_PATH_IMAGE026
indicates whether or not the predicted value (0 or 1) of the ith attribute belongs,
Figure 478668DEST_PATH_IMAGE027
representing the predicted probability value belonging to the ith attribute.
Figure 740889DEST_PATH_IMAGE028
Wherein N represents the number of suggestion boxes,
Figure 656892DEST_PATH_IMAGE029
Figure 580986DEST_PATH_IMAGE030
Figure 215230DEST_PATH_IMAGE031
Figure 250182DEST_PATH_IMAGE032
and representing the weight, wherein each weight can be self-defined to take a value according to the actual situation.
Step 309, performing back propagation on the loss function to optimize the model parameters, and obtaining a feature extraction network, a feature fusion network and a multi-task detector.
For example, a preset optimization algorithm, such as a random gradient descent algorithm, may be used to perform iterative optimization on the loss function, thereby optimizing the model parameters to obtain a feature extraction network, a feature fusion network, and a multi-task detector.
Through the design of the multitask detector and the corresponding loss function, the multitask detection process of multiple targets can be fused, and the recall rate and the positioning accuracy of target detection are improved.
In a specific embodiment, the model training process shown in fig. 6 may be executed on a server or an electronic device, and when the model training process is executed on the server, distributed training may be performed by using multiple servers, then the trained model is converted into a format supported by the electronic device by using a conversion tool, then a performance test is performed on the trained model locally on the server by using a corresponding interpreter, the model that passes the test and has good performance is deployed in a preset application program, and then the preset application program is installed on the electronic device, so that target detection is achieved by using the deployed model. The whole model training process can be finished end to end at one time to obtain a multi-task detection model, multi-task detection is realized at one time, the workload of model training is reduced, and the difficulty and the workload of the model in the conversion and deployment stages of a mobile terminal are reduced.
Experiments prove that the model trained by the embodiment of the invention is used for target detection on electronic equipment, so that the multi-task real-time detection and evaluation of indexes such as human faces, gestures, expressions, eye spirit and the like in videos can be realized, and the requirements of lightweight model, low delay and high precision are met. Specifically, on the premise that the detection accuracy is more than 90%, the size of the application-side model can be finally compressed within 1MB, and the detection speed can be 20-25 FPS.
Fig. 8 is a schematic structural diagram of an object detection apparatus provided in an embodiment of the present disclosure, and as shown in fig. 8, the apparatus includes:
an obtaining module 401, configured to obtain a feature extraction network, a feature fusion network, and a multi-task detector, where the feature extraction network, the feature fusion network, and the multi-task detector are obtained by performing model training on a fusion image set, and the fusion image set is obtained by performing position, type, attribute, and key point detection and labeling on a preset target in each image in a preset image set by using each single-task detection model;
an extraction module 402, configured to perform feature extraction on a picture to be detected through the feature extraction network to obtain a plurality of original feature maps of different sizes;
a fusion module 403, configured to perform feature fusion on the original feature maps through the feature fusion network to obtain a first preset number of fusion feature maps with different sizes;
and the detection module 404 is configured to perform feature detection on the fused feature map through the multitask detector to obtain a position, a type, an attribute, and a key point of the target to be detected.
In an embodiment, the fused picture set is obtained as follows:
detecting and marking the face and the position in each picture in the preset picture set by using a face detection and positioning model;
detecting and marking the gesture and the position in each picture in the preset picture set by using a gesture detection and positioning model;
detecting and marking the face attribute in each picture in the preset picture set by using a face attribute detection model; and
and detecting and marking the face key points in each picture in the preset picture set by using a face key point detection model.
In an embodiment, the method for obtaining the feature extraction network, the feature fusion network, and the multi-task detector by performing model training using the fusion image set includes:
determining a real frame of the preset target according to the label in the fusion picture set;
inputting each picture in the fused picture set into an initial feature extraction network to obtain a plurality of original training feature graphs with different sizes corresponding to each picture;
inputting the original training feature maps into an initial feature fusion network to obtain a first preset number of fusion training feature maps with different sizes corresponding to each picture;
setting a second preset number of prior frames with different sizes at each pixel point of the fusion training feature map;
inputting the fusion training feature map into an initial multi-task detector so that the initial multi-task detector performs feature detection based on the prior frame to obtain training output data;
calculating a loss function according to the training output data and the real frame;
and performing back propagation on the loss function to optimize model parameters to obtain the feature extraction network, the feature fusion network and the multitask detector.
In one embodiment, before inputting the fused training feature map into the initial multitask detector, the method further comprises:
calculating the intersection ratio of each prior frame and the real frame;
selecting a suggestion box from the prior boxes according to the intersection ratio;
inputting the fused training feature map into an initial multi-task detector to enable the initial multi-task detector to perform feature detection based on the prior frame to obtain training output data, wherein the method comprises the following steps:
and inputting the fused training feature map into the initial multitask detector, so that the initial multitask detector performs feature detection based on the suggestion box to obtain the training output data.
In one embodiment, the calculating a loss function from the training output data and the real box includes:
calculating a type confidence loss function from the training output data and the real box
Figure 286271DEST_PATH_IMAGE003
Target position offset penalty function
Figure 381266DEST_PATH_IMAGE004
Shift loss function of key point position
Figure 752073DEST_PATH_IMAGE005
And attribute confidence loss function
Figure 856295DEST_PATH_IMAGE006
According to the above
Figure 481312DEST_PATH_IMAGE003
Figure 747208DEST_PATH_IMAGE004
Figure 90465DEST_PATH_IMAGE005
And
Figure 998378DEST_PATH_IMAGE006
calculating a loss function
Figure 743480DEST_PATH_IMAGE007
In one embodiment of the present invention, the first and second electrodes are,
Figure 196589DEST_PATH_IMAGE008
wherein the content of the first and second substances,
Figure 27142DEST_PATH_IMAGE009
the number of the types is represented and,
Figure 473167DEST_PATH_IMAGE010
indicates whether or not it belongs to the ith type of predictor,
Figure 72775DEST_PATH_IMAGE011
representing predicted probability values belonging to the ith type;
Figure 680474DEST_PATH_IMAGE012
Figure 998323DEST_PATH_IMAGE013
wherein the content of the first and second substances,
Figure 982460DEST_PATH_IMAGE014
the number of coordinate values representing the frame,
Figure 951421DEST_PATH_IMAGE015
a function representing the loss of smoothness is represented,
Figure 995601DEST_PATH_IMAGE016
indicating the prediction offset to which the suggestion box corresponds,
Figure 800746DEST_PATH_IMAGE017
indicating the true offset to which the suggestion box corresponds,
Figure 588573DEST_PATH_IMAGE018
the difference value of the real coordinate and the predicted coordinate of the preset target is obtained;
Figure 897195DEST_PATH_IMAGE019
Figure 112275DEST_PATH_IMAGE020
wherein the content of the first and second substances,
Figure 139137DEST_PATH_IMAGE021
the number of coordinate values representing the key points,
Figure 478458DEST_PATH_IMAGE022
representing the keypoint prediction offset to which the suggestion box corresponds,
Figure 907166DEST_PATH_IMAGE023
representing the true offsets of the keypoints to which the suggestion boxes correspond,
Figure 293148DEST_PATH_IMAGE024
the difference value of the real coordinate and the predicted coordinate of the key point is obtained;
Figure 72885DEST_PATH_IMAGE025
wherein the content of the first and second substances,
Figure 936936DEST_PATH_IMAGE026
indicates whether or not the predicted value belongs to the ith attribute,
Figure 220149DEST_PATH_IMAGE027
representing the predicted probability value belonging to the ith attribute.
In one embodiment of the present invention, the first and second electrodes are,
Figure 42612DEST_PATH_IMAGE028
wherein N represents the number of the suggestion boxes,
Figure 27754DEST_PATH_IMAGE029
Figure 961075DEST_PATH_IMAGE030
Figure 98796DEST_PATH_IMAGE031
Figure 92159DEST_PATH_IMAGE032
representing the weight.
In one embodiment, the feature extraction network comprises MobleNet-V2 and the feature fusion network comprises a feature pyramid network.
In an embodiment, the detecting module 404 performs feature detection on the fused feature map through the multi-task detector to obtain the position, the type, the attribute, and the key point of the target to be detected, including:
inputting the fusion characteristic diagram into the multitask detector to obtain prediction output data;
decoding the prediction output data to obtain prediction frame data;
filtering the prediction frame data to obtain target frame data;
and marking the original characteristic diagram according to the target frame data to obtain the position, the type, the attribute and the key point of the target to be detected.
It is obvious to those skilled in the art that, for convenience and simplicity of description, the foregoing division of the functional modules is merely used as an example, and in practical applications, the above function distribution may be performed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules to perform all or part of the above described functions. For the specific working process of the functional module, reference may be made to the corresponding process in the foregoing method embodiment, which is not described herein again.
The device of the embodiment of the disclosure can acquire the model trained based on the fused picture set and used for multi-task detection, the fused picture set is obtained by detecting and marking the position, the type, the attribute and the key point of the preset target in each picture in the preset picture set by utilizing each single-task detection model, multi-task one-stop detection is realized based on the model obtained by training, and the detection efficiency is improved; in addition, the model training of the embodiment of the invention is based on the fusion picture set, the multi-task detector is trained in one step, and a plurality of detectors for detecting a plurality of tasks do not need to be trained respectively, so that the workload of the model training is reduced; furthermore, in the detection process, the multi-task shares the feature extraction network and the feature fusion network, so that the network utilization rate is improved, the calculated amount is reduced, and the overall detection efficiency is improved; in addition, target detection is carried out based on the fusion characteristic graphs of different sizes, targets of different sizes can be detected, and detection accuracy is improved.
The embodiment of the present invention further provides a target detection system, which includes an electronic device and a server, where the electronic device may obtain a trained feature extraction network, a feature fusion network, and a multitask detector from the server, and detect a to-be-detected picture based on the feature extraction network, the feature fusion network, and the multitask detector, and a specific detection process may refer to the foregoing embodiments, and details are not described here.
Referring now to FIG. 9, shown is a block diagram of a computer system 500 suitable for use in implementing an electronic device of an embodiment of the present invention. The electronic device shown in fig. 9 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.
As shown in fig. 9, the computer system 500 includes a Central Processing Unit (CPU)501 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)502 or a program loaded from a storage section 508 into a Random Access Memory (RAM) 503. In the RAM 503, various programs and data necessary for the operation of the system 500 are also stored. The CPU 501, ROM 502, and RAM 503 are connected to each other via a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.
The following components are connected to the I/O interface 505: an input portion 506 including a keyboard, a mouse, and the like; an output portion 507 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage portion 508 including a hard disk and the like; and a communication section 509 including a network interface card such as a LAN card, a modem, or the like. The communication section 509 performs communication processing via a network such as the internet. The driver 510 is also connected to the I/O interface 505 as necessary. A removable medium 511 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 510 as necessary, so that a computer program read out therefrom is mounted into the storage section 508 as necessary.
In particular, according to the embodiments of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 509, and/or installed from the removable medium 511. The computer program performs the above-described functions defined in the system of the present invention when executed by the Central Processing Unit (CPU) 501.
It should be noted that the computer readable medium shown in the present invention can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The modules and/or units described in the embodiments of the present invention may be implemented by software, and may also be implemented by hardware. The described modules and/or units may also be provided in a processor, and may be described as: a processor includes an acquisition module, an extraction module, a fusion module, and a detection module. Wherein the names of the modules do not in some cases constitute a limitation of the module itself.
As another aspect, the present invention also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be separate and not incorporated into the device. The computer readable medium carries one or more programs which, when executed by a device, cause the device to comprise: acquiring a feature extraction network, a feature fusion network and a multi-task detector which are obtained by utilizing a fusion picture set to perform model training, wherein the fusion picture set is obtained by utilizing each single-task detection model to detect and mark the position, the type, the attribute and the key point of a preset target in each picture in a preset picture set; extracting the features of the picture to be detected through the feature extraction network to obtain a plurality of original feature maps with different sizes; performing feature fusion on the original feature maps through the feature fusion network to obtain a first preset number of fusion feature maps with different sizes; and performing feature detection on the fusion feature map through the multitask detector to obtain the position, type, attribute and key point of the target to be detected.
According to the technical scheme of the embodiment of the invention, the model trained based on the fusion picture set for multi-task detection can be obtained, the fusion picture set is obtained by detecting and marking the position, the type, the attribute and the key point of the preset target in each picture in the preset picture set by utilizing each single-task detection model, the multi-task one-stop detection is realized based on the model obtained by training, and the detection efficiency is improved; in addition, the model training of the embodiment of the invention is based on the fusion picture set, the multi-task detector is trained in one step, and a plurality of detectors for detecting a plurality of tasks do not need to be trained respectively, so that the workload of the model training is reduced; furthermore, in the detection process, the multi-task shares the feature extraction network and the feature fusion network, so that the network utilization rate is improved, the calculated amount is reduced, and the overall detection efficiency is improved; in addition, target detection is carried out based on the fusion characteristic graphs of different sizes, targets of different sizes can be detected, and detection accuracy is improved.
The above-described embodiments should not be construed as limiting the scope of the invention. Those skilled in the art will appreciate that various modifications, combinations, sub-combinations, and substitutions can occur, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (8)

1. An object detection method for a mobile terminal, comprising:
the method comprises the steps of obtaining a lightweight feature extraction network, a feature fusion network and a multi-task detector which are obtained by utilizing a fusion picture set to conduct model training, wherein the fusion picture set is obtained by utilizing each single-task detection model to conduct position, type, attribute and key point detection and marking on a preset target in each picture in a preset picture set, and the fusion picture set is obtained by the following specific method: detecting and marking the face and the position in each picture in the preset picture set by using a face detection and positioning model; detecting and marking the gesture and the position in each picture in the preset picture set by using a gesture detection and positioning model; detecting and marking the face attribute in each picture in the preset picture set by using a face attribute detection model; detecting and marking the face key points in each picture in the preset picture set by using a face key point detection model;
extracting the features of the picture to be detected through the lightweight feature extraction network to obtain a plurality of original feature maps with different sizes;
performing feature fusion on the original feature maps through the feature fusion network to obtain a first preset number of fusion feature maps with different sizes;
performing one-stop feature detection on the fusion feature map through the multi-task detector, and realizing multi-task detection once to obtain the position, type, attribute and key point of the target to be detected;
the method for obtaining the lightweight feature extraction network, the feature fusion network and the multitask detector by utilizing the fusion picture set to carry out model training comprises the following steps:
determining a real frame of the preset target according to the label in the fusion picture set;
inputting each picture in the fused picture set into an initial lightweight feature extraction network to obtain a plurality of original training feature maps with different sizes corresponding to each picture;
inputting the original training feature maps into an initial feature fusion network to obtain a first preset number of fusion training feature maps with different sizes corresponding to each picture;
setting a second preset number of prior frames with different sizes at each pixel point of the fusion training feature map;
inputting the fusion training feature map into an initial multi-task detector so that the initial multi-task detector performs feature detection based on the prior frame to obtain training output data;
calculating a loss function according to the training output data and the real frame;
performing back propagation on the loss function to optimize model parameters to obtain the lightweight feature extraction network, the feature fusion network and the multitask detector;
said calculating a loss function from said training output data and said real box comprises:
calculating a type confidence loss function from the training output data and the real box
Figure DEST_PATH_IMAGE001
Target position offset penalty function
Figure DEST_PATH_IMAGE002
Shift loss function of key point position
Figure DEST_PATH_IMAGE003
And attribute confidence loss function
Figure DEST_PATH_IMAGE004
According to the above
Figure 983604DEST_PATH_IMAGE001
Figure 43964DEST_PATH_IMAGE002
Figure 3830DEST_PATH_IMAGE003
And
Figure 357189DEST_PATH_IMAGE004
calculating a loss function
Figure DEST_PATH_IMAGE006
The above-mentioned
Figure DEST_PATH_IMAGE007
Wherein the content of the first and second substances,
Figure DEST_PATH_IMAGE008
the number of the types is represented and,
Figure DEST_PATH_IMAGE009
indicates whether or not it belongs to the ith type of predictor,
Figure DEST_PATH_IMAGE010
representing predicted probability values belonging to the ith type;
Figure DEST_PATH_IMAGE011
Figure DEST_PATH_IMAGE012
wherein the content of the first and second substances,
Figure DEST_PATH_IMAGE013
the number of coordinate values representing the frame,
Figure DEST_PATH_IMAGE014
a function representing the loss of smoothness is represented,
Figure DEST_PATH_IMAGE015
indicating the prediction offset to which the suggestion box corresponds,
Figure DEST_PATH_IMAGE016
indicating the true offset to which the suggestion box corresponds,
Figure DEST_PATH_IMAGE017
the difference value of the real coordinate and the predicted coordinate of the preset target is obtained;
Figure DEST_PATH_IMAGE018
Figure DEST_PATH_IMAGE020
wherein the content of the first and second substances,
Figure DEST_PATH_IMAGE021
the number of coordinate values representing the key points,
Figure DEST_PATH_IMAGE022
representing the keypoint prediction offset to which the suggestion box corresponds,
Figure DEST_PATH_IMAGE023
representing the true offsets of the keypoints to which the suggestion boxes correspond,
Figure DEST_PATH_IMAGE024
the difference value of the real coordinate and the predicted coordinate of the key point is obtained;
Figure DEST_PATH_IMAGE025
wherein the content of the first and second substances,
Figure DEST_PATH_IMAGE026
indicates whether or not the predicted value belongs to the ith attribute,
Figure DEST_PATH_IMAGE027
representing the predicted probability value belonging to the ith attribute.
2. The method of claim 1, further comprising, prior to inputting the fused training feature map into an initial multitasking detector:
calculating the intersection ratio of each prior frame and the real frame;
selecting a suggestion box from the prior boxes according to the intersection ratio;
inputting the fused training feature map into an initial multi-task detector to enable the initial multi-task detector to perform feature detection based on the prior frame to obtain training output data, wherein the method comprises the following steps:
and inputting the fused training feature map into the initial multitask detector, so that the initial multitask detector performs feature detection based on the suggestion box to obtain the training output data.
3. The object detection method according to claim 2,
Figure DEST_PATH_IMAGE028
wherein N represents the number of the suggestion boxes,
Figure DEST_PATH_IMAGE029
Figure DEST_PATH_IMAGE030
Figure DEST_PATH_IMAGE031
Figure DEST_PATH_IMAGE032
representing the weight.
4. The object detection method of claim 1, wherein the feature extraction network comprises MobleNet-V2 and the feature fusion network comprises a feature pyramid network.
5. The target detection method according to claim 1, wherein the performing the feature detection on the fused feature map through the multitask detector to obtain the position, the type, the attribute and the key point of the target to be detected comprises:
inputting the fusion characteristic diagram into the multitask detector to obtain prediction output data;
decoding the prediction output data to obtain prediction frame data;
filtering the prediction frame data to obtain target frame data;
and marking the original characteristic diagram according to the target frame data to obtain the position, the type, the attribute and the key point of the target to be detected.
6. An object detection apparatus for a mobile terminal, comprising:
the system comprises an acquisition module, a light-weight feature extraction network, a feature fusion network and a multi-task detector, wherein the acquisition module is used for acquiring the light-weight feature extraction network, the feature fusion network and the multi-task detector which are obtained by utilizing a fusion picture set to perform model training, the fusion picture set is obtained by utilizing each single-task detection model to detect and mark the position, the type, the attribute and the key point of a preset target in each picture in a preset picture set, and the fusion picture set is obtained by the following specific method: detecting and marking the face and the position in each picture in the preset picture set by using a face detection and positioning model; detecting and marking the gesture and the position in each picture in the preset picture set by using a gesture detection and positioning model; detecting and marking the face attribute in each picture in the preset picture set by using a face attribute detection model; detecting and marking the face key points in each picture in the preset picture set by using a face key point detection model;
the extraction module is used for extracting the features of the picture to be detected through the lightweight feature extraction network to obtain a plurality of original feature maps with different sizes;
the fusion module is used for carrying out feature fusion on the original feature maps through the feature fusion network to obtain a first preset number of fusion feature maps with different sizes;
the detection module is used for carrying out one-stop feature detection on the fusion feature map through the multi-task detector, realizing multi-task detection at one time and obtaining the position, type, attribute and key point of a target to be detected;
the method for obtaining the lightweight feature extraction network, the feature fusion network and the multitask detector by utilizing the fusion picture set to carry out model training comprises the following steps:
determining a real frame of the preset target according to the label in the fusion picture set;
inputting each picture in the fused picture set into an initial lightweight feature extraction network to obtain a plurality of original training feature maps with different sizes corresponding to each picture;
inputting the original training feature maps into an initial feature fusion network to obtain a first preset number of fusion training feature maps with different sizes corresponding to each picture;
setting a second preset number of prior frames with different sizes at each pixel point of the fusion training feature map;
inputting the fusion training feature map into an initial multi-task detector so that the initial multi-task detector performs feature detection based on the prior frame to obtain training output data;
calculating a loss function according to the training output data and the real frame;
performing back propagation on the loss function to optimize model parameters to obtain the lightweight feature extraction network, the feature fusion network and the multitask detector;
said calculating a loss function from said training output data and said real box comprises:
calculating a type confidence loss function from the training output data and the real box
Figure 65423DEST_PATH_IMAGE001
Eyes of peopleScalar position offset loss function
Figure 663895DEST_PATH_IMAGE002
Shift loss function of key point position
Figure 478267DEST_PATH_IMAGE003
And attribute confidence loss function
Figure 35150DEST_PATH_IMAGE004
According to the above
Figure 505446DEST_PATH_IMAGE001
Figure 969925DEST_PATH_IMAGE002
Figure 340602DEST_PATH_IMAGE003
And
Figure 802807DEST_PATH_IMAGE004
calculating a loss function
Figure DEST_PATH_IMAGE006A
The above-mentioned
Figure 494819DEST_PATH_IMAGE007
Wherein the content of the first and second substances,
Figure 700673DEST_PATH_IMAGE008
the number of the types is represented and,
Figure 489637DEST_PATH_IMAGE009
indicates whether or not it belongs to the ith type of predictor,
Figure 122744DEST_PATH_IMAGE010
representing predicted probability values belonging to the ith type;
Figure 567632DEST_PATH_IMAGE011
Figure 373914DEST_PATH_IMAGE012
wherein the content of the first and second substances,
Figure 220647DEST_PATH_IMAGE013
the number of coordinate values representing the frame,
Figure 523190DEST_PATH_IMAGE014
a function representing the loss of smoothness is represented,
Figure 720953DEST_PATH_IMAGE015
indicating the prediction offset to which the suggestion box corresponds,
Figure 65347DEST_PATH_IMAGE016
indicating the true offset to which the suggestion box corresponds,
Figure 766587DEST_PATH_IMAGE017
the difference value of the real coordinate and the predicted coordinate of the preset target is obtained;
Figure 741496DEST_PATH_IMAGE018
Figure DEST_PATH_IMAGE020A
wherein the content of the first and second substances,
Figure 895397DEST_PATH_IMAGE021
number of coordinate values representing key pointsThe amount of the compound (A) is,
Figure 43481DEST_PATH_IMAGE022
representing the keypoint prediction offset to which the suggestion box corresponds,
Figure 599228DEST_PATH_IMAGE023
representing the true offsets of the keypoints to which the suggestion boxes correspond,
Figure 807355DEST_PATH_IMAGE024
the difference value of the real coordinate and the predicted coordinate of the key point is obtained;
Figure 884770DEST_PATH_IMAGE025
wherein the content of the first and second substances,
Figure 570966DEST_PATH_IMAGE026
indicates whether or not the predicted value belongs to the ith attribute,
Figure 512378DEST_PATH_IMAGE027
representing the predicted probability value belonging to the ith attribute.
7. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the object detection method as claimed in any one of claims 1 to 5 when executing the program.
8. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the object detection method according to any one of claims 1 to 5.
CN202110180875.4A 2021-02-10 2021-02-10 Target detection method, target detection device, electronic equipment and storage medium Active CN112528977B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202110180875.4A CN112528977B (en) 2021-02-10 2021-02-10 Target detection method, target detection device, electronic equipment and storage medium
PCT/CN2021/111385 WO2022170742A1 (en) 2021-02-10 2021-08-09 Target detection method and apparatus, electronic device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110180875.4A CN112528977B (en) 2021-02-10 2021-02-10 Target detection method, target detection device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN112528977A CN112528977A (en) 2021-03-19
CN112528977B true CN112528977B (en) 2021-07-02

Family

ID=74975739

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110180875.4A Active CN112528977B (en) 2021-02-10 2021-02-10 Target detection method, target detection device, electronic equipment and storage medium

Country Status (2)

Country Link
CN (1) CN112528977B (en)
WO (1) WO2022170742A1 (en)

Families Citing this family (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112528977B (en) * 2021-02-10 2021-07-02 北京优幕科技有限责任公司 Target detection method, target detection device, electronic equipment and storage medium
CN112766244B (en) * 2021-04-07 2021-06-08 腾讯科技(深圳)有限公司 Target object detection method and device, computer equipment and storage medium
CN113065568A (en) * 2021-04-09 2021-07-02 神思电子技术股份有限公司 Target detection, attribute identification and tracking method and system
CN113591567A (en) * 2021-06-28 2021-11-02 北京百度网讯科技有限公司 Target detection method, training method of target detection model and device thereof
CN113408502B (en) * 2021-08-19 2021-12-21 深圳市信润富联数字科技有限公司 Gesture recognition method and device, storage medium and electronic equipment
CN113963167B (en) * 2021-10-29 2022-05-27 北京百度网讯科技有限公司 Method, device and computer program product applied to target detection
CN114418901B (en) * 2022-03-30 2022-08-09 江西中业智能科技有限公司 Image beautifying processing method, system, storage medium and equipment based on Retinaface algorithm
CN115376093A (en) * 2022-10-25 2022-11-22 苏州挚途科技有限公司 Object prediction method and device in intelligent driving and electronic equipment
CN115880717B (en) * 2022-10-28 2023-11-17 北京此刻启动科技有限公司 Heat map key point prediction method and device, electronic equipment and storage medium
CN115661577B (en) * 2022-11-01 2024-04-16 吉咖智能机器人有限公司 Method, apparatus and computer readable storage medium for object detection
CN118053136A (en) * 2022-11-16 2024-05-17 华为技术有限公司 Target detection method, device and storage medium
CN115512188A (en) * 2022-11-24 2022-12-23 苏州挚途科技有限公司 Multi-target detection method, device, equipment and medium
CN115861839B (en) * 2022-12-06 2023-08-29 平湖空间感知实验室科技有限公司 Weak and small target detection method and system for geostationary orbit and electronic equipment
CN116246128B (en) * 2023-02-28 2023-10-27 深圳市锐明像素科技有限公司 Training method and device of detection model crossing data sets and electronic equipment
CN117029673B (en) * 2023-07-12 2024-05-10 中国科学院水生生物研究所 Fish body surface multi-size measurement method based on artificial intelligence

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107066932A (en) * 2017-01-16 2017-08-18 北京龙杯信息技术有限公司 The detection of key feature points and localization method in recognition of face
CN110163187A (en) * 2019-06-02 2019-08-23 东北石油大学 Remote road traffic sign detection recognition methods based on F-RCNN
CN110363124A (en) * 2019-07-03 2019-10-22 广州多益网络股份有限公司 Rapid expression recognition and application method based on face key points and geometric deformation
CN110647834A (en) * 2019-09-18 2020-01-03 北京市商汤科技开发有限公司 Human face and human hand correlation detection method and device, electronic equipment and storage medium
CN111626200A (en) * 2020-05-26 2020-09-04 北京联合大学 Multi-scale target detection network and traffic identification detection method based on Libra R-CNN
CN111666839A (en) * 2020-05-25 2020-09-15 东华大学 Road pedestrian detection system based on improved Faster RCNN

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5849558B2 (en) * 2011-09-15 2016-01-27 オムロン株式会社 Image processing apparatus, image processing method, control program, and recording medium
CN111914861A (en) * 2019-05-08 2020-11-10 北京字节跳动网络技术有限公司 Target detection method and device
CN110674748B (en) * 2019-09-24 2024-02-13 腾讯科技(深圳)有限公司 Image data processing method, apparatus, computer device, and readable storage medium
CN112084860A (en) * 2020-08-06 2020-12-15 中国科学院空天信息创新研究院 Target object detection method and device and thermal power plant detection method and device
CN112528977B (en) * 2021-02-10 2021-07-02 北京优幕科技有限责任公司 Target detection method, target detection device, electronic equipment and storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107066932A (en) * 2017-01-16 2017-08-18 北京龙杯信息技术有限公司 The detection of key feature points and localization method in recognition of face
CN110163187A (en) * 2019-06-02 2019-08-23 东北石油大学 Remote road traffic sign detection recognition methods based on F-RCNN
CN110363124A (en) * 2019-07-03 2019-10-22 广州多益网络股份有限公司 Rapid expression recognition and application method based on face key points and geometric deformation
CN110647834A (en) * 2019-09-18 2020-01-03 北京市商汤科技开发有限公司 Human face and human hand correlation detection method and device, electronic equipment and storage medium
CN111666839A (en) * 2020-05-25 2020-09-15 东华大学 Road pedestrian detection system based on improved Faster RCNN
CN111626200A (en) * 2020-05-26 2020-09-04 北京联合大学 Multi-scale target detection network and traffic identification detection method based on Libra R-CNN

Also Published As

Publication number Publication date
WO2022170742A1 (en) 2022-08-18
CN112528977A (en) 2021-03-19

Similar Documents

Publication Publication Date Title
CN112528977B (en) Target detection method, target detection device, electronic equipment and storage medium
CN111368685B (en) Method and device for identifying key points, readable medium and electronic equipment
CN109508681A (en) The method and apparatus for generating human body critical point detection model
CN109858333B (en) Image processing method, image processing device, electronic equipment and computer readable medium
US11704357B2 (en) Shape-based graphics search
CN111369427A (en) Image processing method, image processing device, readable medium and electronic equipment
CN113704531A (en) Image processing method, image processing device, electronic equipment and computer readable storage medium
CN109325996B (en) Method and device for generating information
CN113177472A (en) Dynamic gesture recognition method, device, equipment and storage medium
CN112232311B (en) Face tracking method and device and electronic equipment
CN110349161A (en) Image partition method, device, electronic equipment and storage medium
CN114511661A (en) Image rendering method and device, electronic equipment and storage medium
CN110110666A (en) Object detection method and device
CN111209856B (en) Invoice information identification method and device, electronic equipment and storage medium
CN113762109B (en) Training method of character positioning model and character positioning method
CN114332590A (en) Joint perception model training method, joint perception device, joint perception equipment and medium
CN110110696A (en) Method and apparatus for handling information
CN110288691B (en) Method, apparatus, electronic device and computer-readable storage medium for rendering image
CN111741329A (en) Video processing method, device, equipment and storage medium
CN113610856B (en) Method and device for training image segmentation model and image segmentation
CN111353470B (en) Image processing method and device, readable medium and electronic equipment
CN115424060A (en) Model training method, image classification method and device
CN111968030B (en) Information generation method, apparatus, electronic device and computer readable medium
CN114022658A (en) Target detection method, device, storage medium and terminal
CN113762260A (en) Method, device and equipment for processing layout picture and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant