CN112528977B

CN112528977B - Target detection method, target detection device, electronic equipment and storage medium

Info

Publication number: CN112528977B
Application number: CN202110180875.4A
Authority: CN
Inventors: 李超超
Original assignee: Beijing Youmu Technology Co ltd
Current assignee: Beijing Youmu Technology Co ltd
Priority date: 2021-02-10
Filing date: 2021-02-10
Publication date: 2021-07-02
Anticipated expiration: 2041-02-10
Also published as: WO2022170742A1; CN112528977A

Abstract

The embodiment of the invention provides a target detection method, a target detection device, electronic equipment and a storage medium, wherein the target detection method comprises the following steps: acquiring a feature extraction network, a feature fusion network and a multi-task detector which are obtained by utilizing a fusion picture set to perform model training, wherein the fusion picture set is obtained by utilizing each single-task detection model to detect and mark the position, the type, the attribute and the key point of a preset target in each picture in a preset picture set; extracting the features of the picture to be detected through a feature extraction network to obtain a plurality of original feature maps with different sizes; performing feature fusion on the original feature maps through a feature fusion network to obtain a first preset number of fusion feature maps with different sizes; and performing feature detection on the fusion feature map through a multitask detector to obtain the position, type, attribute and key point of the target to be detected. The embodiment of the invention can reduce the workload of model training, realize multi-task one-stop detection and improve the detection efficiency.

Description

Target detection method, target detection device, electronic equipment and storage medium

Technical Field

The embodiment of the invention relates to the technical field of computers, in particular to a target detection method, a target detection device, electronic equipment and a storage medium.

Background

According to traditional target detection, a detector can only achieve detection of a single task, for example, the position of a detected target, a plurality of detectors need to be trained to achieve detection of a plurality of tasks at the same time, the plurality of tasks are sequentially and serially carried out by the aid of the plurality of detectors, model training workload is large, and overall detection efficiency is low.

Disclosure of Invention

The embodiment of the invention provides a target detection method, a target detection device, electronic equipment and a storage medium, which can reduce the workload of model training, realize multi-task one-stop detection and improve the detection efficiency.

In a first aspect, an embodiment of the present invention provides a target detection method, including:

acquiring a feature extraction network, a feature fusion network and a multi-task detector which are obtained by utilizing a fusion picture set to perform model training, wherein the fusion picture set is obtained by utilizing each single-task detection model to detect and mark the position, the type, the attribute and the key point of a preset target in each picture in a preset picture set;

extracting the features of the picture to be detected through the feature extraction network to obtain a plurality of original feature maps with different sizes;

performing feature fusion on the original feature maps through the feature fusion network to obtain a first preset number of fusion feature maps with different sizes;

and performing feature detection on the fusion feature map through the multitask detector to obtain the position, type, attribute and key point of the target to be detected.

In a second aspect, an embodiment of the present invention provides an object detection apparatus, including:

the system comprises an acquisition module, a feature extraction network, a feature fusion network and a multi-task detector, wherein the acquisition module is used for acquiring the feature extraction network, the feature fusion network and the multi-task detector which are obtained by utilizing a fusion picture set to carry out model training, and the fusion picture set is obtained by utilizing each single-task detection model to detect and mark the position, the type, the attribute and a key point of a preset target in each picture in the preset picture set;

the extraction module is used for extracting the features of the picture to be detected through the feature extraction network to obtain a plurality of original feature maps with different sizes;

the fusion module is used for carrying out feature fusion on the original feature maps through the feature fusion network to obtain a first preset number of fusion feature maps with different sizes;

and the detection module is used for carrying out feature detection on the fusion feature map through the multitask detector to obtain the position, the type, the attribute and the key point of the target to be detected.

In a third aspect, an embodiment of the present invention provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor executes the computer program to implement the object detection method according to the embodiment of the present invention.

In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements an object detection method according to an embodiment of the present invention.

In the embodiment of the invention, a model trained based on a fused picture set for multi-task detection can be obtained, the fused picture set is obtained by detecting and marking the position, the type, the attribute and the key point of a preset target in each picture in the preset picture set by utilizing each single-task detection model, multi-task one-stop detection is realized based on the model obtained by training, and the detection efficiency is improved; in addition, the model training of the embodiment of the invention is based on the fusion picture set, the multi-task detector is trained in one step, and a plurality of detectors for detecting a plurality of tasks do not need to be trained respectively, so that the workload of the model training is reduced; furthermore, in the detection process, the multi-task shares the feature extraction network and the feature fusion network, so that the network utilization rate is improved, the calculated amount is reduced, and the overall detection efficiency is improved; in addition, target detection is carried out based on the fusion characteristic graphs of different sizes, targets of different sizes can be detected, and detection accuracy is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.

Fig. 1 is a schematic flow chart of a target detection method according to an embodiment of the present invention.

Fig. 2 is a schematic sub-flow chart of a target detection method according to an embodiment of the present invention.

Fig. 3 is a schematic diagram illustrating an effect of the target detection method according to the embodiment of the present invention.

Fig. 4 is a flowchart illustrating an obtaining method of a fused picture set according to an embodiment of the present invention.

Fig. 5 is a schematic diagram illustrating an effect of a training picture according to an embodiment of the present invention.

Fig. 6 is a schematic flowchart of a model training method according to an embodiment of the present invention.

Fig. 7 is a network diagram of a model training process according to an embodiment of the present invention.

Fig. 8 is a schematic structural diagram of an object detection apparatus according to an embodiment of the present invention.

Fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.

Fig. 1 is a schematic flowchart of an object detection method according to an embodiment of the present invention, which may be implemented by an object detection apparatus according to an embodiment of the present invention, where the apparatus may be implemented in software and/or hardware. In a specific embodiment, the apparatus may be integrated into an electronic device, and the electronic device may be a mobile phone, a Personal Computer (PC), a tablet Computer, a notebook Computer, a desktop Computer, or the like. The following embodiments will be described taking as an example the integration of the device in an electronic apparatus. Referring to fig. 1, the method may specifically include the following steps:

step 101, obtaining a feature extraction network, a feature fusion network and a multi-task detector which are obtained by performing model training by using a fusion picture set, wherein the fusion picture set is obtained by detecting and marking the position, the type, the attribute and the key point of a preset target in each picture in the preset picture set by using each single-task detection model.

For example, the single-task detection model may be a model for single-task detection, the single-task detection model may be an existing model selected according to actual detection requirements, or the single-task detection model may also be a model trained according to actual detection requirements. The preset target may be set according to an actual detection requirement, for example, if a model for performing target detection on a speech scene is to be trained, the preset target may include a human face and a hand. Illustratively, the single-task detection model may include, for example: the system comprises a model for detecting and positioning the human face, a model for detecting and positioning gestures, a model for detecting the attribute of the human face and a model for detecting key points of the human face. Wherein, the attributes of the face such as expression, eye spirit, gender, age, etc., and the key points of the face such as eyes, mouth, nose, etc.

Specifically, the preset picture set may be a set formed by a large number of pictures collected from an open platform and related to a scene to be detected, and after each picture in the preset picture set is detected and labeled by each single task detection model, each picture in the fused picture set is a picture labeled with a position, a type, an attribute and a key point of a target.

And 102, performing feature extraction on the picture to be detected through a feature extraction network to obtain a plurality of original feature maps with different sizes.

For example, the picture to be detected may be an independent picture, or may be a series of pictures taken from a video, for example, the picture to be detected may be from a video generated by real-time recording, or may be from a pre-generated video.

Specifically, the pictures to be detected can be input into the feature extraction network, so that pooling and sampling operations are performed on each picture by using the feature extraction network, and a plurality of original feature maps with different sizes corresponding to the pictures to be detected are obtained. In a specific embodiment, the feature graphs extracted by the feature extraction network are sequentially reduced in size from top to bottom, and in order to improve the processing efficiency, a first preset number of feature graphs can be selected from the middle-lower layer feature graphs extracted by the feature extraction network to serve as original feature graphs. The first preset number can be a user-defined value according to an actual situation, and can be, for example, 2 or 3.

And 103, performing feature fusion on the original feature maps through a feature fusion network to obtain a first preset number of fusion feature maps with different sizes.

The original feature maps with different sizes are subjected to feature fusion, so that the semantics of the feature maps can be enhanced, and the prediction accuracy is improved. The specific feature fusion method may, for example, unify the number of channels of each original feature map, start from the original feature map at the lowest layer, perform an upsampling operation on the original feature map at the lower layer, and then perform feature addition, convolution fusion and other operations on the original feature map at the adjacent upper layer, so as to obtain a fused feature map.

And 104, performing feature detection on the fusion feature map through a multi-task detector to obtain the position, type, attribute and key point of the target to be detected.

Specifically, the method for detecting features by using a multi-tasking detector can be shown in fig. 2, and includes the following steps:

step 1041, inputting the fused feature map into a multitask detector to obtain predicted output data.

And 1042, decoding the prediction output data to obtain prediction frame data.

The prediction box data may include prediction box position data, which may include center point coordinates and width and height of the prediction box, type data, which may include type names (such as face and hand) and type confidence, attribute data, which may include coordinates of respective key points, and attribute data, which may include attribute names (such as smile, anger, embarrassment) and attribute confidence.

Step 1043, filtering the prediction frame data to obtain target frame data;

for example, the prediction box data with the confidence lower than the preset confidence threshold may be filtered, and the NMS algorithm may be applied to the remaining prediction box data to filter the prediction box data with a large overlap degree, so as to finally obtain the target box data.

And step 1044, marking the original characteristic diagram according to the target frame data to obtain the position, type, attribute and key point of the target to be detected.

When the picture to be detected comes from the video, the target detection method is sequentially executed for each image in the video, and dynamic and continuous detection results can be seen in the video.

For example, for a speech video, the detection results of various speech indexes can be dynamically displayed in the video through the detection method provided by the embodiment of the invention, and for a real-time speech scene, a speaker can adjust the speaker according to the detection results presented in the video, so as to present a good speech effect. Furthermore, the speaker can be scored according to the detection result.

In a specific embodiment, as shown in fig. 3, the detection result presented by using the target detection method provided by the embodiment of the present invention may be represented by a picture in which a detection frame of a Face and a type thereof (Face), a detection frame of a gesture and a type thereof (Hand) are presented, and attributes (e.g., expression) and key points (e.g., eyes, nose, mouth) of the Face are labeled in the detection frame of the Face, and numbers in the figure represent confidence.

In the embodiment of the invention, a method for detecting and labeling the preset picture set by using a plurality of single-task detection models can be used for obtaining the fusion picture set, a model for multi-task detection is trained based on the fusion picture set, multi-task one-stop detection is realized based on the trained model, and the detection efficiency is improved; in addition, the model training of the embodiment of the invention is based on the fusion picture set, the multi-task detector is trained in one step, and a plurality of detectors for detecting a plurality of tasks do not need to be trained respectively, so that the workload of the model training is reduced, and convenience is provided for the model transformation and deployment; furthermore, in the detection process, the multi-task shares the feature extraction network and the feature fusion network, so that the network utilization rate is improved, the calculated amount is reduced, and the overall detection efficiency is improved; in addition, target detection is carried out based on the fusion characteristic graphs of different sizes, targets of different sizes can be detected, and detection accuracy is improved.

In a specific embodiment, as shown in fig. 4, the fused picture set can be obtained as follows:

step 201, detecting and marking the face and the position in each picture in the preset picture set by using a face detection and positioning model.

For example, the face detection and positioning model may be used to detect the face in each picture in the preset picture set, and after the face is detected, the position of the face may be framed and the type may be labeled.

Step 202, detecting and marking the gesture and the position in each picture in the preset picture set by using a gesture detection and positioning model.

For example, a hand in each picture in the preset picture set may be detected by using the gesture detection and positioning model, and after the hand is detected, a frame may be added to the position of the hand and the type may be labeled.

And 203, detecting and marking the face attribute in each picture in the preset picture set by using the face attribute detection model.

For example, the face attribute detection model may be used to detect the face attributes in each picture in the preset picture set, such as expression (e.g., smile, laugh, embarrassment), gaze (e.g., point view, virtual view, and circular view), gender, age, and the like, and mark the attributes at the corresponding positions of the pictures according to the detection results.

And 204, detecting the human face key points in each picture in the preset picture set by using a human face key point detection model.

For example, a face key point detection model may be used to detect face key points, such as eyes, mouth, nose, etc., in each picture in a preset picture set, and mark key points at corresponding positions of the pictures according to the detection result.

After the position, the type, the attribute and the key point of a preset target in each picture in a preset picture set are detected and marked, a fused picture set is obtained. For example, as shown in fig. 5, one picture in the fused picture set may be shown, in fig. 5, the labeled type includes Face and Hand handle, the labeled attribute is expression No-smile, and the labeled key point is eye.

In a specific embodiment, as shown in fig. 6, the method for obtaining the feature extraction network, the feature fusion network, and the multi-task detector by using the fusion image set to perform model training may be as follows:

step 301, determining a real frame of a preset target according to the label in the fusion picture set.

In a specific embodiment, a preset target corresponds to a real frame group route, and the determined real frame may include a position tag (which may include a center point coordinate and a width and a height of the real frame), a type tag, an attribute tag, and a key point tag of the real frame.

Step 302, inputting each picture in the fused picture set into an initial feature extraction network to obtain a plurality of original training feature maps with different sizes corresponding to each picture.

In a specific implementation, the initial feature extraction network may be a lightweight feature extraction network, including but not limited to networks such as MobileNet, ShuffleNet, SqueezeNet, and the like, and various iterative versions thereof. In order to meet the requirements of multiple aspects such as detection speed, model size, detection precision and the like, in the embodiment of the invention, the MobileNet-V2 can be selected as an initial feature extraction network, and the width scaling factor of the MobileNet-V2 is set to be 0.35, so that the parameter quantity of the initial feature extraction network is reduced to about 30M, and a foundation is laid for realizing the lightweight and low-delay performance of a mobile terminal model.

Exemplarily, each picture in the fused picture set is input into an initial feature extraction network, so that pooling and sampling operations are performed on each picture by using the initial feature extraction network, and a plurality of original training feature maps with different sizes corresponding to each picture are obtained.

For example, the size of the picture input into the initial feature extraction network is (224, 224, 3), and after feature extraction by the initial feature extraction network, feature maps of 1/2, 1/4, 1/8, 1/16, and 1/32, i.e., feature maps of (112, 112, 16), (56, 56, 24), (28, 28, 32), (14, 14, 96), (7, 7, 160) of the original image, can be obtained, where a large-size feature map can be used to detect a small-size object, and a small-size feature map can be used to detect a large-size object. In order to improve the processing efficiency, the feature map of the middle and lower layers extracted by the initial feature extraction network may be selected as the original training feature map, and for example, the feature maps with the sizes of (28, 28, 32), (14, 14, 96), (7, 7, 160) may be used as the original training feature map.

Step 303, inputting the original training feature maps into the initial feature fusion network to obtain a first preset number of fusion training feature maps with different sizes corresponding to each picture.

For example, the initial feature fusion network may be a feature pyramid network, that is, the feature pyramid network may be used to perform feature fusion on original training feature maps of different sizes, so as to enhance the semantics of the feature maps, and obtain a first preset number of fused training feature maps of different sizes.

For example, as shown in fig. 7, the original training feature maps are (28, 28, 32), (14, 14, 96) and (7, 7, 160), the original training feature maps with the scales of (28, 28, 32), (14, 14, 96) and (7, 7, 160) may be first convolved by 1 × 1, the number of channels may be uniformly adjusted to 64, feature maps with the scales of (28, 28, 64), (14, 14, 64) and (7, 7, 64) may be obtained, and the feature map with the scale of (7, 7, 64) may be directly used as one fused training feature map.

And performing up-sampling operation on the original training feature map with the size of (7, 7, 64) to obtain a feature map with the size of (14, 14, 64), and performing feature addition and convolution fusion on the feature map with the size of (14, 14, 64) and the feature map with the size of (14, 14, 64) obtained after channel adjustment to obtain a fused training feature map with the size of (14, 14, 64).

Similarly, the feature map with the size of (28, 28, 64) obtained after the up-sampling operation is performed on the feature map with the size of (14, 14, 64) after the fusion, and the feature map with the size of (28, 28, 64) obtained after the channel adjustment is subjected to feature addition and convolution fusion to obtain the fusion training feature map with the size of (28, 28, 64). That is, in the example of fig. 7, there are 3 output fused training feature maps.

It should be noted that the above-described feature fusion method is only an example, and in practical applications, other feature fusion methods may also be adopted, and are not specifically limited herein.

And step 304, setting a second preset number of different-size prior frames at each pixel point of the fusion training feature map.

In one possible implementation, the prior box may be a rectangular box, and the prior box may be set as follows: determining the size of the fusion training feature maps, determining the prior frame size for each fusion training feature map according to the preset relation between the prior frame size and the feature map size, and then setting the prior frames with different sizes in a second preset number on each pixel point of the corresponding fusion training feature maps according to the prior frame size determined for each fusion training feature map and the preset prior frame number (namely, the second preset number). The second preset number can be a user-defined value according to an actual situation, and can be, for example, 2 or 3. In the example shown in fig. 7, the number of the prior frames set on each pixel point of the fused training feature map is 2.

In some embodiments, the prior frame may also be set in other manners, such as according to the size of a preset target, which is not specifically limited herein.

Step 305, calculating the intersection ratio of each prior frame and the real frame.

For example, the intersection ratio may be calculated according to the following formula:

wherein the content of the first and second substances,

denotes the cross-over ratio, a denotes the size of the prior box, and B denotes the size of the real box.

And step 306, selecting a suggestion box from the prior check boxes according to the intersection ratio.

Because each pixel point of the fusion training characteristic diagram is provided with a second preset number of prior frames with different sizes, the number of the prior frames is large, in order to improve the processing efficiency, a suggested frame can be selected from the prior frames according to the intersection ratio, and the specific selection method can be as follows:

(1) and selecting a prior frame with the intersection ratio larger than a preset threshold as a suggestion frame, wherein the preset threshold can be preset according to the actual situation, for example, the preset threshold can be 0.6, 0.7 and the like.

(2) And selecting the prior frame with the largest intersection ratio as a suggestion frame.

Specifically, each preset target may correspond to one real box and at least one suggestion box, and in the training stage, target detection may be performed based on the suggestion box corresponding to each preset target.

And 307, inputting the fused training feature map into the initial multitask detector, so that the initial multitask detector performs feature detection based on the suggestion box to obtain training output data.

Step 308, calculating a loss function based on the training output data and the real box.

Specifically, the real frame may be understood as a frame on the original image, and the real frame may be encoded on the fused training feature map, and then the type confidence loss function may be calculated according to the training output data and the encoded real frame

Target position offset penalty function

Shift loss function of key point position

And attribute confidence loss function

(ii) a According to the above

、

、

And

calculating a loss function

。

Wherein the content of the first and second substances,

wherein the content of the first and second substances,

the number of types, particularly in embodiments of the invention,

the value is 3, namely the types are 3 types, namely the types are respectively a human face, a hand and a background,

indicates whether it belongs to the predicted value (0 or 1) of the ith type,

representing predicted probability values belonging to the ith type;

wherein the content of the first and second substances,

the coordinate value quantity corresponding to the upper left corner and the lower right corner of the representation frame,

the value is usually 4, which is the number,

a function representing the loss of smoothness is represented,

indicating the prediction offset to which the suggestion box corresponds,

indicating the true offset to which the suggestion box corresponds,

as a difference between the real coordinates and the predicted coordinates of the preset target,

can be obtained according to the position data of the output frame corresponding to the suggestion frame and the position data corresponding to the real frame,

the position data of the suggestion frame and the position data corresponding to the real frame can be obtained;

wherein the content of the first and second substances,

the number of coordinate values representing the key points,

representing the keypoint prediction offset to which the suggestion box corresponds,

representing the true offsets of the keypoints to which the suggestion boxes correspond,

the difference value of the real coordinate and the predicted coordinate of the key point is obtained;

wherein the content of the first and second substances,

indicates whether or not the predicted value (0 or 1) of the ith attribute belongs,

representing the predicted probability value belonging to the ith attribute.

Wherein N represents the number of suggestion boxes,

、

and representing the weight, wherein each weight can be self-defined to take a value according to the actual situation.

Step 309, performing back propagation on the loss function to optimize the model parameters, and obtaining a feature extraction network, a feature fusion network and a multi-task detector.

For example, a preset optimization algorithm, such as a random gradient descent algorithm, may be used to perform iterative optimization on the loss function, thereby optimizing the model parameters to obtain a feature extraction network, a feature fusion network, and a multi-task detector.

Through the design of the multitask detector and the corresponding loss function, the multitask detection process of multiple targets can be fused, and the recall rate and the positioning accuracy of target detection are improved.

In a specific embodiment, the model training process shown in fig. 6 may be executed on a server or an electronic device, and when the model training process is executed on the server, distributed training may be performed by using multiple servers, then the trained model is converted into a format supported by the electronic device by using a conversion tool, then a performance test is performed on the trained model locally on the server by using a corresponding interpreter, the model that passes the test and has good performance is deployed in a preset application program, and then the preset application program is installed on the electronic device, so that target detection is achieved by using the deployed model. The whole model training process can be finished end to end at one time to obtain a multi-task detection model, multi-task detection is realized at one time, the workload of model training is reduced, and the difficulty and the workload of the model in the conversion and deployment stages of a mobile terminal are reduced.

Experiments prove that the model trained by the embodiment of the invention is used for target detection on electronic equipment, so that the multi-task real-time detection and evaluation of indexes such as human faces, gestures, expressions, eye spirit and the like in videos can be realized, and the requirements of lightweight model, low delay and high precision are met. Specifically, on the premise that the detection accuracy is more than 90%, the size of the application-side model can be finally compressed within 1MB, and the detection speed can be 20-25 FPS.

Fig. 8 is a schematic structural diagram of an object detection apparatus provided in an embodiment of the present disclosure, and as shown in fig. 8, the apparatus includes:

an obtaining module 401, configured to obtain a feature extraction network, a feature fusion network, and a multi-task detector, where the feature extraction network, the feature fusion network, and the multi-task detector are obtained by performing model training on a fusion image set, and the fusion image set is obtained by performing position, type, attribute, and key point detection and labeling on a preset target in each image in a preset image set by using each single-task detection model;

an extraction module 402, configured to perform feature extraction on a picture to be detected through the feature extraction network to obtain a plurality of original feature maps of different sizes;

a fusion module 403, configured to perform feature fusion on the original feature maps through the feature fusion network to obtain a first preset number of fusion feature maps with different sizes;

and the detection module 404 is configured to perform feature detection on the fused feature map through the multitask detector to obtain a position, a type, an attribute, and a key point of the target to be detected.

In an embodiment, the fused picture set is obtained as follows:

detecting and marking the face and the position in each picture in the preset picture set by using a face detection and positioning model;

detecting and marking the gesture and the position in each picture in the preset picture set by using a gesture detection and positioning model;

detecting and marking the face attribute in each picture in the preset picture set by using a face attribute detection model; and

and detecting and marking the face key points in each picture in the preset picture set by using a face key point detection model.

In an embodiment, the method for obtaining the feature extraction network, the feature fusion network, and the multi-task detector by performing model training using the fusion image set includes:

determining a real frame of the preset target according to the label in the fusion picture set;

inputting each picture in the fused picture set into an initial feature extraction network to obtain a plurality of original training feature graphs with different sizes corresponding to each picture;

inputting the original training feature maps into an initial feature fusion network to obtain a first preset number of fusion training feature maps with different sizes corresponding to each picture;

setting a second preset number of prior frames with different sizes at each pixel point of the fusion training feature map;

inputting the fusion training feature map into an initial multi-task detector so that the initial multi-task detector performs feature detection based on the prior frame to obtain training output data;

calculating a loss function according to the training output data and the real frame;

and performing back propagation on the loss function to optimize model parameters to obtain the feature extraction network, the feature fusion network and the multitask detector.

In one embodiment, before inputting the fused training feature map into the initial multitask detector, the method further comprises:

calculating the intersection ratio of each prior frame and the real frame;

selecting a suggestion box from the prior boxes according to the intersection ratio;

inputting the fused training feature map into an initial multi-task detector to enable the initial multi-task detector to perform feature detection based on the prior frame to obtain training output data, wherein the method comprises the following steps:

and inputting the fused training feature map into the initial multitask detector, so that the initial multitask detector performs feature detection based on the suggestion box to obtain the training output data.

In one embodiment, the calculating a loss function from the training output data and the real box includes:

calculating a type confidence loss function from the training output data and the real box

Target position offset penalty function

Shift loss function of key point position

And attribute confidence loss function

；

According to the above

、

、

And

calculating a loss function

。

In one embodiment of the present invention, the first and second electrodes are,

wherein the content of the first and second substances,

the number of the types is represented and,

indicates whether or not it belongs to the ith type of predictor,

representing predicted probability values belonging to the ith type;

wherein the content of the first and second substances,

the number of coordinate values representing the frame,

a function representing the loss of smoothness is represented,

indicating the prediction offset to which the suggestion box corresponds,

indicating the true offset to which the suggestion box corresponds,

the difference value of the real coordinate and the predicted coordinate of the preset target is obtained;

wherein the content of the first and second substances,

the number of coordinate values representing the key points,

wherein the content of the first and second substances,

indicates whether or not the predicted value belongs to the ith attribute,

representing the predicted probability value belonging to the ith attribute.

wherein N represents the number of the suggestion boxes,

、

representing the weight.

In one embodiment, the feature extraction network comprises MobleNet-V2 and the feature fusion network comprises a feature pyramid network.

In an embodiment, the detecting module 404 performs feature detection on the fused feature map through the multi-task detector to obtain the position, the type, the attribute, and the key point of the target to be detected, including:

inputting the fusion characteristic diagram into the multitask detector to obtain prediction output data;

decoding the prediction output data to obtain prediction frame data;

filtering the prediction frame data to obtain target frame data;

and marking the original characteristic diagram according to the target frame data to obtain the position, the type, the attribute and the key point of the target to be detected.

It is obvious to those skilled in the art that, for convenience and simplicity of description, the foregoing division of the functional modules is merely used as an example, and in practical applications, the above function distribution may be performed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules to perform all or part of the above described functions. For the specific working process of the functional module, reference may be made to the corresponding process in the foregoing method embodiment, which is not described herein again.

The device of the embodiment of the disclosure can acquire the model trained based on the fused picture set and used for multi-task detection, the fused picture set is obtained by detecting and marking the position, the type, the attribute and the key point of the preset target in each picture in the preset picture set by utilizing each single-task detection model, multi-task one-stop detection is realized based on the model obtained by training, and the detection efficiency is improved; in addition, the model training of the embodiment of the invention is based on the fusion picture set, the multi-task detector is trained in one step, and a plurality of detectors for detecting a plurality of tasks do not need to be trained respectively, so that the workload of the model training is reduced; furthermore, in the detection process, the multi-task shares the feature extraction network and the feature fusion network, so that the network utilization rate is improved, the calculated amount is reduced, and the overall detection efficiency is improved; in addition, target detection is carried out based on the fusion characteristic graphs of different sizes, targets of different sizes can be detected, and detection accuracy is improved.

The embodiment of the present invention further provides a target detection system, which includes an electronic device and a server, where the electronic device may obtain a trained feature extraction network, a feature fusion network, and a multitask detector from the server, and detect a to-be-detected picture based on the feature extraction network, the feature fusion network, and the multitask detector, and a specific detection process may refer to the foregoing embodiments, and details are not described here.

Referring now to FIG. 9, shown is a block diagram of a computer system 500 suitable for use in implementing an electronic device of an embodiment of the present invention. The electronic device shown in fig. 9 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.

As shown in fig. 9, the computer system 500 includes a Central Processing Unit (CPU)501 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)502 or a program loaded from a storage section 508 into a Random Access Memory (RAM) 503. In the RAM 503, various programs and data necessary for the operation of the system 500 are also stored. The CPU 501, ROM 502, and RAM 503 are connected to each other via a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.

The following components are connected to the I/O interface 505: an input portion 506 including a keyboard, a mouse, and the like; an output portion 507 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage portion 508 including a hard disk and the like; and a communication section 509 including a network interface card such as a LAN card, a modem, or the like. The communication section 509 performs communication processing via a network such as the internet. The driver 510 is also connected to the I/O interface 505 as necessary. A removable medium 511 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 510 as necessary, so that a computer program read out therefrom is mounted into the storage section 508 as necessary.

In particular, according to the embodiments of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 509, and/or installed from the removable medium 511. The computer program performs the above-described functions defined in the system of the present invention when executed by the Central Processing Unit (CPU) 501.

It should be noted that the computer readable medium shown in the present invention can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules and/or units described in the embodiments of the present invention may be implemented by software, and may also be implemented by hardware. The described modules and/or units may also be provided in a processor, and may be described as: a processor includes an acquisition module, an extraction module, a fusion module, and a detection module. Wherein the names of the modules do not in some cases constitute a limitation of the module itself.

As another aspect, the present invention also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be separate and not incorporated into the device. The computer readable medium carries one or more programs which, when executed by a device, cause the device to comprise: acquiring a feature extraction network, a feature fusion network and a multi-task detector which are obtained by utilizing a fusion picture set to perform model training, wherein the fusion picture set is obtained by utilizing each single-task detection model to detect and mark the position, the type, the attribute and the key point of a preset target in each picture in a preset picture set; extracting the features of the picture to be detected through the feature extraction network to obtain a plurality of original feature maps with different sizes; performing feature fusion on the original feature maps through the feature fusion network to obtain a first preset number of fusion feature maps with different sizes; and performing feature detection on the fusion feature map through the multitask detector to obtain the position, type, attribute and key point of the target to be detected.

According to the technical scheme of the embodiment of the invention, the model trained based on the fusion picture set for multi-task detection can be obtained, the fusion picture set is obtained by detecting and marking the position, the type, the attribute and the key point of the preset target in each picture in the preset picture set by utilizing each single-task detection model, the multi-task one-stop detection is realized based on the model obtained by training, and the detection efficiency is improved; in addition, the model training of the embodiment of the invention is based on the fusion picture set, the multi-task detector is trained in one step, and a plurality of detectors for detecting a plurality of tasks do not need to be trained respectively, so that the workload of the model training is reduced; furthermore, in the detection process, the multi-task shares the feature extraction network and the feature fusion network, so that the network utilization rate is improved, the calculated amount is reduced, and the overall detection efficiency is improved; in addition, target detection is carried out based on the fusion characteristic graphs of different sizes, targets of different sizes can be detected, and detection accuracy is improved.

The above-described embodiments should not be construed as limiting the scope of the invention. Those skilled in the art will appreciate that various modifications, combinations, sub-combinations, and substitutions can occur, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. An object detection method for a mobile terminal, comprising:

the method comprises the steps of obtaining a lightweight feature extraction network, a feature fusion network and a multi-task detector which are obtained by utilizing a fusion picture set to conduct model training, wherein the fusion picture set is obtained by utilizing each single-task detection model to conduct position, type, attribute and key point detection and marking on a preset target in each picture in a preset picture set, and the fusion picture set is obtained by the following specific method: detecting and marking the face and the position in each picture in the preset picture set by using a face detection and positioning model; detecting and marking the gesture and the position in each picture in the preset picture set by using a gesture detection and positioning model; detecting and marking the face attribute in each picture in the preset picture set by using a face attribute detection model; detecting and marking the face key points in each picture in the preset picture set by using a face key point detection model;

extracting the features of the picture to be detected through the lightweight feature extraction network to obtain a plurality of original feature maps with different sizes;

performing one-stop feature detection on the fusion feature map through the multi-task detector, and realizing multi-task detection once to obtain the position, type, attribute and key point of the target to be detected;

the method for obtaining the lightweight feature extraction network, the feature fusion network and the multitask detector by utilizing the fusion picture set to carry out model training comprises the following steps:

inputting each picture in the fused picture set into an initial lightweight feature extraction network to obtain a plurality of original training feature maps with different sizes corresponding to each picture;

performing back propagation on the loss function to optimize model parameters to obtain the lightweight feature extraction network, the feature fusion network and the multitask detector;

said calculating a loss function from said training output data and said real box comprises:

Target position offset penalty function

Shift loss function of key point position

And attribute confidence loss function

；

According to the above

、

、

And

calculating a loss function

；

The above-mentioned

Wherein the content of the first and second substances,

the number of the types is represented and,

indicates whether or not it belongs to the ith type of predictor,

representing predicted probability values belonging to the ith type;

wherein the content of the first and second substances,

the number of coordinate values representing the frame,

a function representing the loss of smoothness is represented,

indicating the prediction offset to which the suggestion box corresponds,

indicating the true offset to which the suggestion box corresponds,

wherein the content of the first and second substances,

the number of coordinate values representing the key points,

wherein the content of the first and second substances,

indicates whether or not the predicted value belongs to the ith attribute,

representing the predicted probability value belonging to the ith attribute.

2. The method of claim 1, further comprising, prior to inputting the fused training feature map into an initial multitasking detector:

calculating the intersection ratio of each prior frame and the real frame;

3. The object detection method according to claim 2,

wherein N represents the number of the suggestion boxes,

、

representing the weight.

4. The object detection method of claim 1, wherein the feature extraction network comprises MobleNet-V2 and the feature fusion network comprises a feature pyramid network.

5. The target detection method according to claim 1, wherein the performing the feature detection on the fused feature map through the multitask detector to obtain the position, the type, the attribute and the key point of the target to be detected comprises:

decoding the prediction output data to obtain prediction frame data;

filtering the prediction frame data to obtain target frame data;

6. An object detection apparatus for a mobile terminal, comprising:

the system comprises an acquisition module, a light-weight feature extraction network, a feature fusion network and a multi-task detector, wherein the acquisition module is used for acquiring the light-weight feature extraction network, the feature fusion network and the multi-task detector which are obtained by utilizing a fusion picture set to perform model training, the fusion picture set is obtained by utilizing each single-task detection model to detect and mark the position, the type, the attribute and the key point of a preset target in each picture in a preset picture set, and the fusion picture set is obtained by the following specific method: detecting and marking the face and the position in each picture in the preset picture set by using a face detection and positioning model; detecting and marking the gesture and the position in each picture in the preset picture set by using a gesture detection and positioning model; detecting and marking the face attribute in each picture in the preset picture set by using a face attribute detection model; detecting and marking the face key points in each picture in the preset picture set by using a face key point detection model;

the extraction module is used for extracting the features of the picture to be detected through the lightweight feature extraction network to obtain a plurality of original feature maps with different sizes;

the detection module is used for carrying out one-stop feature detection on the fusion feature map through the multi-task detector, realizing multi-task detection at one time and obtaining the position, type, attribute and key point of a target to be detected;

Eyes of peopleScalar position offset loss function

Shift loss function of key point position

And attribute confidence loss function

；

According to the above

、

、

And

calculating a loss function

；

The above-mentioned

Wherein the content of the first and second substances,

the number of the types is represented and,

indicates whether or not it belongs to the ith type of predictor,

representing predicted probability values belonging to the ith type;

wherein the content of the first and second substances,

the number of coordinate values representing the frame,

a function representing the loss of smoothness is represented,

indicating the prediction offset to which the suggestion box corresponds,

indicating the true offset to which the suggestion box corresponds,

wherein the content of the first and second substances,

number of coordinate values representing key pointsThe amount of the compound (A) is,

wherein the content of the first and second substances,

indicates whether or not the predicted value belongs to the ith attribute,

representing the predicted probability value belonging to the ith attribute.

7. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the object detection method as claimed in any one of claims 1 to 5 when executing the program.

8. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the object detection method according to any one of claims 1 to 5.