CN117611795A

CN117611795A - Target detection method and model training method based on multi-task AI large model

Info

Publication number: CN117611795A
Application number: CN202311402511.1A
Authority: CN
Inventors: 姬东飞; 杜雨亭; 陆勤; 龚建
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2023-10-26
Filing date: 2023-10-26
Publication date: 2024-02-27

Abstract

The disclosure provides a target detection method and a model training method based on a multitasking AI large model, and relates to the technical field of computers, in particular to the technical fields of artificial intelligence, neural network models and smart cities. The specific implementation scheme of the target detection method is as follows: identifying a target object in an image to be detected to obtain a first identification result; determining a first alarm object from the first recognition result according to the confidence coefficient of the first recognition result and a first threshold corresponding to the first accuracy rate, and taking the first alarm object as a detection result; under the condition that the triggering condition is met, performing target detection on the image to be complemented corresponding to the first identification result to obtain a second identification result; determining a second alarm object from the second recognition result according to the confidence coefficient of the second recognition result and a second threshold corresponding to the second accuracy rate; and updating the detection result according to the second alarm object. The method and the device can ensure high accuracy of target detection and reduce the recall rate.

Description

Target detection method and model training method based on multi-task AI large model

Technical Field

The present disclosure relates to the field of computer technology, and in particular, to the field of artificial intelligence, neural network models, target detection, and smart city technology.

Background

Through the application of advanced technologies such as artificial intelligence, fine management, intelligent analysis and scientific planning on cities can be realized, for example, in application scenes such as intelligent traffic, intelligent construction sites, intelligent communities, intelligent elevators and the like, targets such as pedestrians, motor vehicles and non-motor vehicles appearing in a monitoring video are required to be identified by using an AI (Artificial Intelligence ) model so as to detect illegal targets in the targets.

Disclosure of Invention

The disclosure provides a target detection method, a model training method, a device, equipment and a storage medium based on a multitasking AI large model.

According to an aspect of the present disclosure, there is provided a target detection method based on a multitasking AI large model, including:

identifying a target object in an image to be detected to obtain a first identification result;

determining a first alarm object from the first recognition result according to the confidence coefficient of the first recognition result and a first threshold corresponding to the first accuracy rate, and taking the first alarm object as a detection result;

under the condition that the triggering condition is met, performing target detection on the image to be complemented corresponding to the first identification result to obtain a second identification result;

determining a second alarm object from the second recognition result according to the confidence coefficient of the second recognition result and a second threshold corresponding to the second accuracy rate; wherein the first accuracy is higher than the second accuracy; and

And updating the detection result according to the second alarm object.

According to another aspect of the present disclosure, there is provided a model training method of a multitasking AI large model, including:

acquiring sample images of various detection scenes, scene identifiers of the sample images and real labels of the sample images;

performing target detection on the sample image by utilizing a plurality of output networks of the second model to obtain a plurality of prediction labels; wherein, the plurality of output networks correspond to a plurality of detection scenes;

determining a first loss function according to the real label, the scene identifier and the plurality of prediction labels of the sample image; and

and updating parameters of the second model according to the first loss function to obtain a trained multi-task detection model.

According to another aspect of the present disclosure, there is provided an object detection apparatus based on a multitasking AI large model, including:

the first identification module is used for identifying a target object in the image to be detected to obtain a first identification result;

the first screening module is used for determining a first alarm object from the first identification result according to the confidence coefficient of the first identification result and a first threshold corresponding to the first accuracy rate, and taking the first alarm object as a detection result;

the second identification module is used for carrying out target detection on the image to be complemented corresponding to the first identification result under the condition that the triggering condition is met, so as to obtain a second identification result;

The second screening module is used for determining a second alarm object from the second identification result according to the confidence coefficient of the second identification result and a second threshold corresponding to the second accuracy rate; wherein the first accuracy is higher than the second accuracy; and

and the updating module is used for updating the detection result according to the second alarm object.

According to another aspect of the present disclosure, there is provided a model training apparatus of a multitasking AI large model, including:

the acquisition module is used for acquiring sample images of various detection scenes, scene identifications of the sample images and real labels of the sample images;

the prediction module is used for carrying out target detection on the sample image by utilizing a plurality of output networks of the second model to obtain a plurality of prediction labels; wherein, the plurality of output networks correspond to a plurality of detection scenes;

the loss function determining module is used for determining a first loss function according to the real label, the scene identifier and the plurality of prediction labels of the sample image; and

and the training module is used for updating parameters of the second model according to the first loss function so as to obtain a trained multi-task detection model.

According to another aspect of the present disclosure, there is provided an electronic device including:

At least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of the embodiments of the present disclosure.

According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform a method according to any one of the embodiments of the present disclosure.

According to another aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements a method according to any of the embodiments of the present disclosure.

The method and the device can supplement recall to the missed call target while ensuring high accuracy of target detection, and reduce the missed call rate.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a flow diagram of a method of object detection based on a multitasking AI large model in accordance with one embodiment of the disclosure;

FIG. 2 is a flow diagram of a multi-scenario parallel object detection method according to an embodiment of the present disclosure;

FIG. 3 is a flow diagram of a method of target detection for single-scenario multiple-model integration according to an embodiment of the present disclosure;

FIG. 4 is a flow diagram of a detection system according to an embodiment of the present disclosure;

FIG. 5 is a flow diagram of a model training method for a multitasking AI large model in accordance with one embodiment of the disclosure;

FIG. 6 is a schematic structural diagram of a multi-task AI large model-based object detection apparatus in accordance with one embodiment of the disclosure;

FIG. 7 is a schematic diagram of a model training apparatus for a multitasking AI large model in accordance with one embodiment of the disclosure;

fig. 8 is a block diagram of an electronic device used to implement an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In the related art, the object detection system may use a single model or multi-model integration. And extracting frames from the real-time play stream of the monitoring video, and outputting the identification result by serially calling a single model or a plurality of models. The recognition result typically includes one or more category labels and corresponding confidence levels, and may also include target location coordinates. In practical application, the system needs to set a confidence threshold value, and the system pushes the alarm to the identification target larger than the confidence threshold value.

AI models refer to mathematical models or algorithms in artificial intelligence systems for solving a particular task or problem. These models may be various forms of mathematical expressions, neural networks, decision trees, and the like. They extract patterns and rules from the data through learning and training, enabling the system to make various decisions such as predictions, classification, generation, etc. in the face of new inputs.

For AI models, false recognition is unavoidable, and recognition results can be evaluated by Precision (Precision) and Recall (Recall), which are two competing evaluation indexes, and it is difficult to achieve both high Precision and high Recall due to their difference in calculation mode and point of interest. If a higher threshold is set, the accuracy is higher but the recall is lower. If a lower threshold is set, the recall is higher but the precision is lower.

When the scheme is practically applied, only the balance between the accuracy and the recall rate can be achieved, and the comprehensive effect of the system is limited. To at least partially solve one or more of the above problems and other potential problems, embodiments of the present disclosure provide a target detection method that can supplement recall of missed targets while ensuring high accuracy of target detection, reducing missed rate.

Fig. 1 is a flowchart of a target detection method based on a multitasking AI large model according to an embodiment of the disclosure. As shown in fig. 1, the method at least comprises the following steps:

s101, identifying a target object in an image to be detected, and obtaining a first identification result.

In the embodiment of the disclosure, the target object may be understood as an object to be identified for performing target detection on an image to be detected, and may be a person or an object. For different detection scenarios, different target objects are typically corresponding, for example in pedestrian detection scenarios, the target object is a pedestrian. In a vehicle detection scenario, the target object is a motor vehicle or a non-motor vehicle. In other detection scenarios, the target object may be a movable object such as a drone or a vessel. The target object may also be a specific static object, such as a trash can, a manhole cover, a road guardrail, a roof billboard, etc. The target object may also be a person or vehicle that performs a particular action, such as a pedestrian or motor vehicle running a red light, a vehicle parked in a no-stop area, a person entering a no-entry area, a person smoking in a no-smoking area, etc. It may also be an object in a designated state, for example, a well head missing a manhole cover, a displaced roadway guard rail, a filled trash can, etc.

The first recognition result may be a category label of the target object, or may be at least one of detection frame information, position information, and division information, according to a difference in detection scene. Confidence is also typically included in the first recognition result. The above recognition result and confidence level can be obtained using classification model, detection model, segmentation model in the prior art. Other commonly used models and techniques may also be used to play a role in target detection. For example:

keypoint detection model (Keypoint Detection): the keypoint detection model is used to identify specific keypoints or keypoints locations in images, such as joint points (e.g., shoulders, elbows, knees, etc.) in human body posture detection. The output of the model is the location coordinates of each keypoint.

Pose Estimation model (Pose Estimation): the attitude estimation model integrates the results of the detection of the key points, and the integral attitude of the object in the image is identified by connecting the key points into an individual attitude so as to conveniently detect whether the target object implements a specific behavior.

Example segmentation model (Instance Segmentation): the example segmentation model not only marks the bounding box of the target, but also segments the pixels inside the target on the basis of target detection, thereby accurately distinguishing the pixels of different targets.

Scene understanding model (Scene Understanding): scene understanding models attempt to identify more semantic information from images, such as objects, scenes, relationships, etc., than just pure object detection.

The output of the above model varies depending on the particular task and model type:

the classification model outputs a class label of the object.

The detection model outputs position information (typically coordinates of a bounding box) of the object and a class label.

The segmentation model outputs a pixel-level object mask for distinguishing between different objects in the image.

The key point detection model outputs the position coordinates of the key points.

The pose estimation model outputs pose information of the complete object.

The example segmentation model outputs a pixel-level segmentation mask for each object.

The scene understanding model outputs semantic information about scenes, objects, relationships, etc.

S102, determining a first alarm object from the first recognition result according to the confidence coefficient of the first recognition result and a first threshold corresponding to the first accuracy rate, and taking the first alarm object as a detection result.

In the embodiments of the present disclosure, the first accuracy may be understood as a high accuracy, and may generally take a value between 90% and 100%, for example, set to 95%. And determining the first alarm object from the first recognition result, wherein the first alarm object is a target object with the confidence coefficient larger than a first threshold value in the first recognition result, and the target object is determined to be the first alarm object.

The first threshold corresponding to the first accuracy rate may be understood as that when the first threshold is used to screen the first alarm object, the accuracy rate of screening out the correct alarm object may reach the requirement of the first accuracy rate. The first threshold may be determined using the first accuracy and the sample data.

The first alarm object is determined from the first recognition result and can be realized through category control and threshold control, wherein the category control and the threshold control are two different strategies or methods for recognizing and processing the violation targets.

Class-based Filtering (Class-based Filtering) refers to that in the process of target detection, the detected target object is first classified according to the Class to which the detected target object belongs. Subsequent processing or analysis is then performed on the specific class of target objects, such as alerting, logging or other corresponding processing of the specific class of targets.

For example, if it is desired to detect both human and vehicle targets in a surveillance video, then category management is to first classify the detected targets into both human and vehicle targets and process each target separately.

Threshold-based Filtering refers to screening out targets with specific attributes or characteristics by setting a Threshold in the process of target detection. The threshold may be a measure related to the confidence, score, etc. of the target, with targets above the threshold being considered acceptable and targets below the threshold being considered unacceptable.

For example, if the object detection model outputs a confidence score for each detected object, the threshold arrangement may set a score threshold, and only objects with confidence above the threshold will be considered valid objects.

In summary, both class control and threshold control are a screening and control method based on target detection, and they can be selectively used according to specific scenes and requirements, so as to realize accurate recognition and processing of the offending targets.

The first alarm object may be understood as a high-accuracy detection result based on a high confidence threshold. After the first alarm object is determined, the first alarm object can be stored in an alarm database, and corresponding alarm operation is performed.

And S103, under the condition that the triggering condition is met, performing target detection on the image to be complemented corresponding to the first identification result by utilizing the multi-task AI large model to obtain a second identification result.

In the embodiment of the disclosure, the triggering condition may be that a predetermined time is reached, or the number of the first cached identification results reaches a predetermined value, or the like, or may be manually triggered by a person.

The multi-task AI large model refers to a model which simultaneously processes a plurality of related but different tasks or targets, and meanwhile, the model has a large number of parameters, and the parameters can reach millions to billions or more, so that the large model has strong learning capability and expression capability which cannot be realized by a common AI model. The multitasking AI large model combines the training and reasoning processes of multiple tasks in one model to improve the overall performance of the model. Typically, these tasks share the feature extraction portion of the model, but have separate output layers (the head of the model) to handle the different tasks. The feature extraction part can adopt common schemes such as a depth residual network ResNet, a Transformer and variants thereof. The image to be complemented can be understood as an image which needs to be subjected to target detection again in order to reduce missed calls to the target. The image to be complemented corresponding to the first recognition result can be an image corresponding to the first recognition result with the confidence coefficient not reaching the first threshold value, or an image corresponding to the first recognition result with the confidence coefficient smaller than the first threshold value but larger than another lower threshold value, so that partial target objects with the too low confidence coefficient are eliminated, and unnecessary calculation amount is reduced.

The target detection can be performed on the image to be complemented corresponding to the first recognition result, the detection means of the primary target detection can be used, other methods and detection means can also be used, for example, a new detection model is used, and therefore two models are integrated. Integration refers to the pooling together of predictions from a series of different models to obtain better predictions.

S104, determining a second alarm object from the second recognition result according to the confidence coefficient of the second recognition result and a second threshold corresponding to the second accuracy rate. Wherein the first accuracy is higher than the second accuracy.

In the embodiment of the disclosure, a target object with the confidence coefficient larger than a second threshold value is screened from the second recognition result and used as a second alarm object. Because the first accuracy is higher than the second accuracy, the recall missing target can be supplemented, and the recall missing caused by the fact that the primary target detection is based on a high confidence threshold (a first threshold corresponding to the first accuracy) is solved.

It should be noted that, the first threshold and the second threshold are specific to different target detection means, and the first threshold is not necessarily greater than the second threshold, although the first accuracy is higher than the second accuracy due to the difference between the capabilities of the different detection means.

S105, updating the detection result according to the second alarm object.

In the embodiment of the disclosure, the obtained second alarm object is supplemented into the alarm database, so that the detection result is updated. Before replenishment, a deduplication operation may be performed with the first alert object.

According to the scheme of the embodiment of the disclosure, the first detection means is used for high-accuracy target detection, and the other detection means is used for supplementing recall of the missed recall target while ensuring the high-accuracy target, so that the missed recall rate is reduced, and the comprehensive effect is improved.

In one possible implementation manner, the target detection method of the embodiment of the disclosure further includes the steps of:

a plurality of third thresholds is selected.

And determining a third accuracy rate of selecting the alarm object according to a plurality of third thresholds by using a sample data set, wherein the sample data set contains sample pictures marked with the alarm object.

And determining a minimum threshold value meeting the target accuracy rate from the plurality of third threshold values according to the third accuracy rate and the target accuracy rate to obtain the first threshold value or the second threshold value.

When the target accuracy rate is the first accuracy rate, a minimum threshold value satisfying the target accuracy rate is taken as a first threshold value.

And when the target accuracy rate is the second accuracy rate, taking the minimum threshold value meeting the target accuracy rate as the second threshold value.

In the embodiment of the disclosure, target detection is performed on a sample picture marked with a real label by adopting any detection means, threshold retrieval is performed on the basis of a detection result and the real label, and a target threshold meeting the requirement of target accuracy is determined. In one example, the threshold retrieval scheme may be a grid search, resulting in a plurality of third thresholds by setting search intervals and step sizes. And calculating third accuracy rates of the screened alarm objects under different third thresholds according to the real labels. And then determining the minimum threshold value in the third threshold value with the third accuracy rate being larger than the target accuracy rate to obtain a target threshold value (the first threshold value or the second threshold value). For example, the plurality of third thresholds are respectively 0.40, 0.45, 0.50, 0.55 and 0.60, and the detection result is screened by using the thresholds, so as to obtain the alarm object. And then comparing the alarm object with the real label of the sample picture to determine the accuracy corresponding to each third threshold value. For example, 67%, 76%, 88%, 96%, 97%, respectively, wherein the third accuracy rate corresponding to 0.55, 0.60 is 96%, 97%, and is greater than the first accuracy rate of 95%. The minimum threshold, 0.55, is selected as the first threshold. A second threshold value corresponding to a second accuracy rate may be obtained in the same manner.

According to the scheme of the embodiment of the disclosure, as the importance of the normal accuracy rate is higher than the recall rate, the threshold corresponding to the high accuracy rate and the low accuracy rate is used for asynchronous detection, the high accuracy alarm object is determined based on the high accuracy rate, then the recall missing target is supplemented by the low accuracy rate, frequent push false alarms are reduced, and the recall rate is improved.

In one possible implementation manner, step S101 identifies a target object in the image to be detected, and obtains a first identification result, and further includes the steps of:

s1011, selecting a plurality of first models according to detection scenes corresponding to the images to be detected.

S1012, identifying target objects associated with the detection scene in the image to be detected by utilizing each of the plurality of first models respectively so as to obtain a plurality of third identification results.

S1013, fusing the plurality of third recognition results to obtain a first recognition result.

In the embodiment of the present disclosure, the first model includes, but is not limited to, any model commonly used for target detection, such as the classification model, the detection model, the segmentation model, the keypoint detection model, the pose estimation model, the instance segmentation model, the scene understanding model, and the like in the step S101. One or more first models can be used as detection means for different detection scenarios, that is to say one or more first models correspond to one detection scenario, that is to say the first model is a small model which is trained specifically for one detection scenario, so that a higher accuracy can be achieved. Meanwhile, since the first models of the respective detection scenes are independent of each other, the tasks of the plurality of detection scenes can be performed in parallel.

In one example, as shown in fig. 2, a pedestrian detection task (scene 1), a motor vehicle red light running detection task (scene 2) and a motor vehicle illegal parking detection task (scene 3) are simultaneously executed on one monitoring device, a frame drawing picture (to-be-detected image) is obtained from a video stream collected by the monitoring device, the to-be-detected image is respectively input into a detection model (AI small model 1) for detecting pedestrians, a detection model (AI small model 2) for detecting motor vehicles red light running and a detection model (AI small model 3) for detecting motor vehicles illegal parking, three different first models can be executed in parallel, and a first alarm object with high accuracy rate is screened out through category distribution control and threshold distribution control after identification results are obtained respectively. Thus, the detection system can support a plurality of parallel scenes, and the AI small model can be a classification model, a detection model, a segmentation model or the like, and changes according to scene changes.

In another example, as shown in fig. 3, for one detection scene, 3 different detection models may be input to the frame-extracting picture at the same time to perform target detection respectively, so as to obtain 3 recognition results, and then the three results are fused to obtain a first recognition result. The fusion method comprises common integration methods such as intersection set taking, union set taking, detection frame weight fusion and the like. And screening out the first alarm object with high accuracy through category control and threshold control.

According to the scheme of the embodiment of the disclosure, the detection system can support parallel of a plurality of scenes, the detection efficiency and the detection capability of the system can be improved, and a single scene can also support multi-model integration, so that the accuracy rate is improved. Meanwhile, the small model specially trained for each scene is used for carrying out the target detection of the first round, so that the accuracy rate of the screened alarm objects can be more effectively ensured.

In one possible implementation manner, step S101 identifies a target object in an image to be detected, and obtains a first identification result, and further includes the steps of:

and extracting video frames from the monitoring video stream at preset time intervals to obtain an image to be detected.

And identifying the target object in the image to be detected to obtain a first identification result.

In the embodiment of the disclosure, the monitoring video stream is acquired by monitoring equipment, and frames are extracted from the video stream at preset time intervals to obtain frame-extracted pictures, namely images to be detected.

According to the scheme of the embodiment of the disclosure, the target detection of the monitoring video by using a frame extraction mode can reduce the calculation cost, reduce redundant information, save storage space, improve the model speed and reduce the problem of data unbalance, and in some scenes, the target in the video can only appear in a short time and is a static background in most of the time. By frame extraction, the problem of data unbalance can be alleviated, so that the model can learn the characteristics of the target more easily.

In one possible implementation manner, S102 determines, as a detection result, a first alarm object from the first recognition result according to the confidence level of the first recognition result and a first threshold corresponding to the first accuracy rate, and further includes the steps of:

s1021, determining the confidence level of a plurality of target objects in the first recognition result.

S1022, screening target objects with confidence coefficient higher than a first threshold value from the plurality of target objects to obtain a first alarm object as a detection result.

In the embodiment of the present disclosure, the first recognition result generally includes class labels and confidence degrees of a plurality of objects output by the first model, and according to the class labels, target objects associated with the detection scene may be screened out. The first alert object may then be screened from the plurality of target objects based on the confidence level and the first threshold.

According to the scheme of the embodiment of the disclosure, the alarm objects are efficiently screened from the recognition results output by the model through category control and threshold control.

according to the first recognition result, one frame of video image corresponding to the first recognition result or a plurality of frames of video images in a preset time window are cached and used as images to be complemented.

In an embodiment of the present disclosure, the first recognition result may be stored in a recognition result cache database, where information stored in the recognition result cache database includes: the time stamp of each recognition result, the frame picture, the detection result (including but not limited to category labels, detection frame information, segmentation information, confidence, etc.), the control category, etc.

The recognition result cache database can store data in a dimension of the monitoring device, and can only cache the recognition result of the current frame in the storage quantity, or cache the recognition result of a plurality of frames in a window in the form of a sliding window, and record the distribution and control types of all scene tasks corresponding to the current device. The sliding window may be centered on the current frame and a time window may be determined with a preset length of time.

According to the scheme of the embodiment of the disclosure, the first recognition result is cached, so that the recall is conveniently supplemented.

In one possible implementation, the multi-task AI large model is trained from sample images of a plurality of detection scenarios.

In the embodiment of the disclosure, the multi-task AI large model is different from the first model, and the two models are used for detection respectively, so that the model integration effect is realized. For several models trained independently, each model observes the data from slightly different angles, yielding a predicted result. The prediction result is only a part of the 'true phase', not all true phases. Each model attempts to understand the training data on its own hypothesis and from its own perspective. Each model yields a portion, but not all, of the true phase of the data. The views of each model are converged together, so that more accurate expression is obtained.

It should be noted that, even though the two models obtained by training using the same training data are the same initial model, the two models obtained finally may be slightly different due to randomness in the training process.

In the related art of object detection, the large multitasking AI model may be a multitasking detection model trained from sample images of various detection scenarios, and may include the following common tasks:

object detection: an object in the image is identified and its location is located.

Object classification: it is determined to which category the detected object belongs.

Object segmentation: the object in the image is segmented into pixel-level regions.

And (3) key point detection: key points in the image, such as eyes, nose and other parts of the face, are identified.

Posture estimation: the pose of the human body or object, including the position of the joints, is estimated.

And (3) shielding detection: it is detected whether the object is occluded by another object or an occlusion.

Scene understanding: the scene in the image is understood at a high level, such as judging whether indoor, outdoor and the like.

Generating an image description: a natural language description is generated that is related to the image.

Because a plurality of detection scenes exist, each detection scene contains one or a plurality of different tasks, the multi-task detection model is used for carrying out supplementary recall, and the multi-task detection model has the advantage that a plurality of related tasks can be simultaneously solved, so that the comprehensive performance of the model is improved. The shared feature extraction part is beneficial to the model learning to the general feature representation, thereby improving the performance of each task, saving the computing resource and reducing the number of models.

In one example, as shown in fig. 4, the detection system includes a main system and a subsystem, the main system is responsible for realizing high-precision target detection based on a first model (AI small model in the figure), the subsystem is responsible for realizing recall of a main system recall target based on a multi-task AI large model, the first model is usually used for executing a specific task, and compared with the multi-task AI large model, parameters and used training data of the model are less than those of the multi-task AI large model, so the model is called an AI small model. The subsystem supports timing triggering, searches by taking video equipment as a unit, acquires a history frame picture of a certain monitoring equipment, calls a multi-task AI large model to infer, performs target filtering according to the control types corresponding to all scene tasks of the current monitoring equipment, and performs target filtering according to a second threshold corresponding to a second accuracy rate (for example, any value between 80% and 90%) to obtain a second alarm object. And then, filtering the target according to the historical alarm result to avoid repeated alarms, obtaining a supplementary recall alarm target, and storing the supplementary recall alarm target into an alarm database, wherein the alarm time is the historical time stamp of the supplementary recall alarm target.

According to the scheme of the embodiment of the disclosure, the detection system is divided into a main system and an asynchronous parallel subsystem thereof, and the recall missing target of the main system caused by a high confidence threshold is supplemented and recalled through timing triggering. The subsystem does not need to have good system performance, is realized based on a multi-task detection model in order to improve the end-to-end effect, improves model identification accuracy through mass parameters, and provides a more accurate recall target for the main system.

Fig. 5 is a flowchart of a model training method of a multitasking AI large model according to an embodiment of the disclosure. As shown in fig. 5, the method at least comprises the following steps:

s501, acquiring sample images of various detection scenes, scene identifications of the sample images and real labels of the sample images.

S502, performing target detection on the sample image by utilizing a plurality of output networks of the second model to obtain a plurality of prediction labels. Wherein, a plurality of output networks correspond to a plurality of detection scenes.

S503, determining a first loss function according to the real label, the scene identifier and the plurality of prediction labels of the sample image.

S504, updating parameters of the second model according to the first loss function to obtain a trained multi-task detection model.

In the embodiment of the disclosure, the model is trained by using the sample images of various detection scenes, so that the model can mutually reference the knowledge of each scene, and the model effect is improved. The problem of training label loss can be caused during multi-scene task mixed training, for example, the data of the scene 1 only marks the target of the scene 1, the data of the scene 2 only marks the target of the scene 2, but the data of the scene 1 does not mark the target of the scene 2, and vice versa. In order to solve the problem, scene identification information can be added to the sample image, so that an output network (task head) corresponding to the current task only trains the labeling picture of the current scene.

The second model includes a feature extractor network, a feature pyramid network, and a plurality of output networks corresponding to different scenarios, where each scenario may correspond to a different task. The feature extractor can adopt common schemes such as a depth residual network ResNet, a Transformer and variants thereof. It is to be appreciated that the feature extractor may be trained to obtain an initialization model based on a self-supervision of a common scheme such as a mask auto encoder extensible visual learner model (Masked Autoencoders Are Scalable Vision Learners, MAE). The feature pyramid can adopt a common scheme such as a feature pyramid network (Feature Pyramid Network, FPN), a bidirectional feature pyramid network (Bi-directional Feature Pyramid Network, biFPN) and the like.

For multi-scene task mix training, it is necessary to label the sample image with its scene index. In one example, scene 1 is human detection (containing tasks of subclass n1, index 0-n 1-1), scene 2 is motor vehicle detection (containing tasks of subclass n2, index 0-n 2-1), scene 3 is non-motor vehicle detection (containing tasks of subclass n3, index 0-n 3-1), scene 4 is urban management violation detection (containing subclass n4, index 0-n 4-1), and for the j subclass tag index of the i-th scene, it is re-labeled as follows, and the scene index i is labeled.

According to the scheme of the embodiment of the disclosure, the model is enabled to have knowledge capability of each scene based on the multi-task head network, so that knowledge of each scene can be mutually referred, and the model effect is improved. Meanwhile, a single model supports multi-scene capability, and the deployment cost can be reduced.

In one possible implementation, S503 determines a first loss function according to the real label, the scene identifier, and the prediction label of the sample image, including:

s5031, determining a target output network and other output networks from a plurality of output networks of the second model according to the scene identification of the sample image. Wherein the scene identification is associated with the target output network.

S5032, determining a second loss function according to the real label of the sample image and the prediction label generated by the target output network.

S5033, determining a third loss function according to the real label of the sample image, the prediction label generated by other output network, and the task mask.

S5034, determining a first loss function according to the second loss function and the third loss function.

In the embodiment of the present disclosure, the target output network is an output network corresponding to a scene of the sample image. The determined second loss function may be used for training the model based on the real label of the sample image and the predictive label generated by the target output network. And according to the real labels of the sample images and the prediction labels generated by other output networks, the determined third loss function needs to be discarded, so that the parameters of the model are prevented from being involved in back propagation update. Therefore, the third penalty function needs to be multiplied by the task mask, thereby zeroing out the third penalty function.

According to the scheme of the embodiment of the disclosure, the output network corresponding to the current task is ensured to only train the labeling picture of the current scene.

In one possible implementation, S5032 determines a second loss function according to the real label of the sample image and the predicted label generated by the target output network, including the steps of:

and determining task weights according to the scene identifications.

And determining a second loss function according to the real label of the sample image, the prediction label generated by the target output network and the task weight.

According to the scheme of the embodiment of the disclosure, the task importance of each scene can be regulated through the task weight, different scenes can have different importance for different tasks, and the model can be better adapted to different scene changes through dynamically regulating the task weight, and the weight of each task can be flexibly regulated according to specific requirements, so that different scenes and applications can be better adapted in multi-task learning.

in the case where the target output network includes a plurality of subtasks, a subtask weight for each of the plurality of subtasks is determined.

For each of the plurality of subtasks, a fourth loss function is determined based on the real label of the sample image and the predictive label generated by the target output network.

And determining a second loss function according to the fourth loss function and the corresponding subtask weight.

In the embodiment of the disclosure, since each scene may correspond to a different task, only a classification Loss function, such as cross entropy CE Loss, is needed for classifying tasks. For detection tasks, a detection box is additionally required to locate a regression Loss function, such as Smooth L1 Loss. For segmentation tasks, segmentation losses are additionally required, such as the bi-class cross entropy BCE Loss.

In one example, a scenario is a detection task, and the loss functions for i detection tasks are calculated as follows:

wherein,for classifying loss functions, ++>To locate the regression loss function, α _i To adjust the task weight, β, of the importance of each scene task _i Weight for adjusting subtask weights of current scene classification and regression task importance _i And for the task mask weight of the current scene, the task mask weight is 1 for the loss function obtained according to the prediction labels of the target output network, and the task mask weight is 0 for the loss function obtained according to the prediction labels of other output networks, so that the current task head is ensured to train only the sample image of the current scene.

According to the scheme of the embodiment of the disclosure, the task weight is adjusted, so that the model is more focused on the task which is more important to the current scene or application, and the performance of the model on a specific task is improved.

In one possible implementation, S505 updates parameters of the second model according to the first loss function to obtain a trained multitasking detection model, including:

and updating parameters of the part except other output networks in the second model according to the first loss function to obtain a trained multi-task detection model.

According to the scheme of the embodiment of the disclosure, the first loss function comprises the loss function obtained by the target output network, and the loss functions of other output networks are set to zero by the task mask, so that the first loss function does not influence the parameters of other output networks when the parameters of the update model are back-propagated.

Fig. 6 is a schematic structural diagram of a target detection device 600 based on a multitasking AI large model according to an embodiment of the disclosure. As shown in fig. 6, the apparatus includes:

the first recognition module 601 is configured to recognize a target object in an image to be detected, and obtain a first recognition result.

The first filtering module 602 is configured to determine, according to the confidence level of the first recognition result and a first threshold corresponding to the first accuracy rate, a first alarm object from the first recognition result, as a detection result.

And the second recognition module 603 is configured to perform target detection on the image to be complemented corresponding to the first recognition result by using the multi-task detection model under the condition that the triggering condition is met, so as to obtain a second recognition result.

The second filtering module 604 is configured to determine a second alarm object from the second recognition result according to the confidence level of the second recognition result and a second threshold corresponding to the second accuracy rate. Wherein the first accuracy is higher than the second accuracy.

And the updating module 605 is configured to update the detection result according to the second alarm object.

In one possible implementation manner, the apparatus further includes a threshold determining module configured to:

a plurality of third thresholds is selected.

And determining a first threshold value from a plurality of third threshold values according to the third accuracy rate and the preset first accuracy rate. Or alternatively

And determining a second threshold value from a plurality of third threshold values according to the third accuracy rate and the preset second accuracy rate.

In one possible implementation, the first identification module 601 is configured to:

and selecting a plurality of first models according to the detection scenes corresponding to the images to be detected.

And identifying target objects associated with the detection scene in the image to be detected by utilizing each of the plurality of first models respectively so as to obtain a plurality of third identification results.

And fusing the plurality of third recognition results to obtain a first recognition result.

In one possible implementation, the first screening module 602 is configured to:

confidence levels of a plurality of target objects in the first recognition result are determined.

And screening target objects with confidence coefficient higher than a first threshold value from the plurality of target objects to obtain a first alarm object as a detection result.

In one possible implementation, the apparatus further includes:

the buffer module is used for buffering one frame of video image corresponding to the first identification result or multiple frames of video images in a preset time window according to the first identification result to serve as images to be complemented.

In one possible implementation, the multitasking detection model is trained from sample images of multiple detection scenarios.

Fig. 7 is a schematic structural diagram of a model training apparatus 700 for a multi-task AI large model according to an embodiment of the disclosure. As shown in fig. 7, the apparatus includes:

the acquiring module 701 is configured to acquire sample images of various detection scenes, scene identifiers of the sample images, and real labels of the sample images.

The prediction module 702 is configured to perform object detection on the sample image by using a plurality of output networks of the second model, so as to obtain a plurality of prediction labels. Wherein, a plurality of output networks correspond to a plurality of detection scenes.

The loss function determining module 703 is configured to determine a first loss function according to the real label of the sample image, the scene identifier, and the plurality of prediction labels. And

The training module 704 is configured to update parameters of the second model according to the first loss function, so as to obtain a trained multitasking detection model.

In one possible implementation, the loss function determination module 703 includes:

and the scene determination submodule is used for determining a target output network and other output networks from a plurality of output networks of the second model according to the scene identification of the sample image. Wherein the scene identification is associated with the target output network.

The first determining submodule is used for determining a second loss function according to the real label of the sample image and the prediction label generated by the target output network.

And the second determining submodule is used for determining a third loss function according to the real labels of the sample images, the prediction labels generated by other output networks and the task masks.

And the summarizing sub-module is used for determining the first loss function according to the second loss function and the third loss function.

In one possible implementation, the first determining submodule is configured to:

and determining task weights according to the scene identifications.

In one possible implementation, training module 704 is configured to:

For descriptions of specific functions and examples of each module and sub-module of the apparatus in the embodiments of the present disclosure, reference may be made to the related descriptions of corresponding steps in the foregoing method embodiments, which are not repeated herein.

In the technical scheme of the disclosure, the images acquired by the monitoring equipment and obtained by frame extraction are used for detecting the offending targets, the aim is to maintain public security, and the collected personal images are only used for maintaining public security and are not used for other purposes.

In the technical scheme of the disclosure, the acquisition, storage, application and the like of the related user personal information all conform to the regulations of related laws and regulations, and the public sequence is not violated.

According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.

Fig. 8 illustrates a schematic block diagram of an example electronic device 800 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile apparatuses, such as personal digital assistants, cellular telephones, smartphones, wearable devices, and other similar computing apparatuses. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 8, the apparatus 800 includes a computing unit 801 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 802 or a computer program loaded from a storage unit 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data required for the operation of the device 800 can also be stored. The computing unit 801, the ROM 802, and the RAM 803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to the bus 804.

Various components in device 800 are connected to I/O interface 805, including: an input unit 806 such as a keyboard, mouse, etc.; an output unit 807 such as various types of displays, speakers, and the like; a storage unit 808, such as a magnetic disk, optical disk, etc.; and a communication unit 809, such as a network card, modem, wireless communication transceiver, or the like. The communication unit 809 allows the device 800 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

The computing unit 801 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 801 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 801 performs the respective methods and processes described above, for example, a target detection method or a training method of a multitasking detection model. For example, in some embodiments, the target detection method or training method of the multitasking detection model may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 808. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 800 via ROM 802 and/or communication unit 809. When the computer program is loaded into the RAM 803 and executed by the computing unit 801, one or more steps of the above-described object detection method or training method of the multitasking detection model may be performed. Alternatively, in other embodiments, the computing unit 801 may be configured to perform the target detection method or the training method of the multi-tasking detection model in any other suitable way (e.g. by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server incorporating a blockchain.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel, sequentially, or in a different order, provided that the desired results of the disclosed aspects are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions, improvements, etc. that are within the principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. A target detection method based on a multitasking AI large model comprises the following steps:

determining a first alarm object from the first recognition result according to the confidence coefficient of the first recognition result and a first threshold corresponding to a first accuracy rate, and taking the first alarm object as a detection result;

under the condition that the triggering condition is met, performing target detection on the image to be complemented corresponding to the first identification result by utilizing a multi-task AI large model to obtain a second identification result;

Determining a second alarm object from the second recognition result according to the confidence coefficient of the second recognition result and a second threshold corresponding to a second accuracy rate; wherein the first accuracy rate is higher than the second accuracy rate; and

and updating the detection result according to the second alarm object.

2. The method of claim 1, further comprising:

selecting a plurality of third thresholds;

determining a third accuracy rate of selecting the alarm object according to the plurality of third thresholds by using a sample data set, wherein the sample data set comprises sample pictures marked with the alarm object;

3. The method of claim 1, wherein identifying the target object in the image to be detected results in a first identification result, comprising:

selecting a plurality of first models according to detection scenes corresponding to the images to be detected;

identifying a target object associated with the detection scene in the image to be detected by utilizing each of the plurality of first models so as to obtain a plurality of third identification results;

4. The method of claim 1, wherein identifying the target object in the image to be detected results in a first identification result, comprising:

extracting video frames from the monitoring video stream at preset time intervals to obtain an image to be detected;

5. The method of claim 1, wherein determining a first alert object from the first recognition result as a detection result according to the confidence level of the first recognition result and a first threshold corresponding to a first accuracy rate comprises:

determining the confidence degrees of a plurality of target objects in the first recognition result;

and screening target objects with the confidence coefficient higher than the first threshold value from the target objects to obtain a first alarm object as a detection result.

6. The method of claim 1, further comprising:

according to the first recognition result, one frame of video image corresponding to the first recognition result or a plurality of frames of video images in a preset time window is cached and used as an image to be complemented.

7. The method of claim 1 or 6, wherein the multitasking AI large model is trained from sample images of multiple detection scenarios.

8. A model training method for a multitasking AI large model, comprising:

performing target detection on the sample image by utilizing a plurality of output networks of the second model to obtain a plurality of prediction labels; wherein the plurality of output networks correspond to the plurality of detection scenarios;

determining a first loss function according to the real label of the sample image, the scene identifier and the plurality of prediction labels; and

and updating parameters of the second model according to the first loss function to obtain a trained multi-task AI large model.

9. The method of claim 8, wherein determining a first loss function from the true label of the sample image, the scene identification, and the predictive label comprises:

determining a target output network and other output networks from a plurality of output networks of the second model according to the scene identification of the sample image; wherein the scene identification is associated with the target output network;

determining a second loss function according to the real label of the sample image and the prediction label generated by the target output network;

Determining a third loss function according to the real label of the sample image, the prediction labels generated by the other output networks and the task mask;

and determining a first loss function according to the second loss function and the third loss function.

10. The method of claim 9, wherein determining a second loss function from the real label of the sample image and the predictive label generated by the target output network comprises:

determining task weights according to the scene identifications;

11. The method of claim 9, wherein determining a second loss function from the real label of the sample image and the predictive label generated by the target output network comprises:

determining a subtask weight of each subtask of a plurality of subtasks in the case that the target output network comprises the subtasks;

determining a fourth loss function according to the real label of the sample image and the prediction label generated by the target output network for each of the plurality of subtasks;

12. The method of claim 9, wherein updating parameters of the second model according to the first loss function to obtain a trained multi-tasking AI large model comprises:

and updating parameters of the part except the other output networks in the second model according to the first loss function to obtain a trained multi-task AI large model.

13. An object detection device based on a multitasking AI large model, comprising:

the first screening module is used for determining a first alarm object from the first identification result according to the confidence coefficient of the first identification result and a first threshold corresponding to a first accuracy rate, and taking the first alarm object as a detection result;

the second recognition module is used for carrying out target detection on the image to be complemented corresponding to the first recognition result by utilizing the multi-task AI large model under the condition that the triggering condition is met, so as to obtain a second recognition result;

the second screening module is used for determining a second alarm object from the second identification result according to the confidence coefficient of the second identification result and a second threshold corresponding to a second accuracy rate; wherein the first accuracy rate is higher than the second accuracy rate; and

14. The apparatus of claim 13, further comprising a threshold determination module to:

selecting a plurality of third thresholds;

15. The apparatus of claim 13, wherein the first identification module is to:

16. The apparatus of claim 13, wherein the first identification module is to:

17. The apparatus of claim 13, wherein the first screening module is to:

18. The apparatus of claim 13, further comprising:

and the caching module is used for caching a frame of video image corresponding to the first identification result or a plurality of frames of video images in a preset time window according to the first identification result, and taking the frame of video image or the plurality of frames of video images in the preset time window as images to be complemented.

19. The apparatus of claim 13 or 18, wherein the multitasking AI large model is trained from sample images of multiple detection scenarios.

20. A model training apparatus for a multitasking AI large model, comprising:

the prediction module is used for carrying out target detection on the sample image by utilizing a plurality of output networks of the second model to obtain a plurality of prediction labels; wherein the plurality of output networks correspond to the plurality of detection scenarios;

The loss function determining module is used for determining a first loss function according to the real label of the sample image, the scene identifier and the plurality of prediction labels; and

and the training module is used for updating the parameters of the second model according to the first loss function so as to obtain a trained multi-task AI large model.

21. The apparatus of claim 20, wherein the loss function determination module comprises:

a scene determination submodule, configured to determine a target output network and other output networks from a plurality of output networks of the second model according to a scene identifier of the sample image; wherein the scene identification is associated with the target output network;

the first determining submodule is used for determining a second loss function according to the real label of the sample image and the prediction label generated by the target output network;

a second determining submodule, configured to determine a third loss function according to the real label of the sample image, the prediction labels generated by the other output networks, and the task mask;

and the summarizing sub-module is used for determining a first loss function according to the second loss function and the third loss function.

22. The apparatus of claim 21, wherein the first determination submodule is to:

determining task weights according to the scene identifications;

23. The apparatus of claim 21, wherein the first determination submodule is to:

24. The apparatus of claim 21, wherein the training module is to:

25. An electronic device, comprising:

at least one processor; and

A memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-12.

26. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-12.

27. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any of claims 1-12.