CN118334604A

CN118334604A - Accident detection and data set construction method and equipment based on multi-mode large model

Info

Publication number: CN118334604A
Application number: CN202410749810.0A
Authority: CN
Inventors: 刘微; 赵长福; 郑维学; 张建安; 鞠全永; 赵越; 陈维强
Original assignee: Hisense Group Holding Co Ltd
Current assignee: Hisense Group Holding Co Ltd
Priority date: 2024-06-12
Filing date: 2024-06-12
Publication date: 2024-07-12

Abstract

The application relates to the technical field of image processing, in particular to an accident detection and data set construction method and device based on a multi-mode large model. In the embodiment of the application, the electronic equipment adopts an accident detection small model to determine the target confidence coefficient of the traffic accident in the target image to be detected; if the target confidence coefficient belongs to a preset confidence coefficient range, determining that the accident detection small model cannot accurately determine whether a traffic accident exists in the target image, and further detecting and identifying the target image by the electronic equipment through the multi-mode large model. In the embodiment of the application, the detection of the traffic accident is realized by combining the size models, and the accident detection precision is improved.

Description

Accident detection and data set construction method and equipment based on multi-mode large model

Technical Field

The application relates to the technical field of image processing, in particular to an accident detection and data set construction method and device based on a multi-mode large model.

Background

With the development of traffic industry, the maintenance amount of motor vehicles is continuously increased, and the traveling mode of people is changed over the sky, wherein traffic safety becomes a main problem of people's relationship, and how to quickly find out car accidents, single car accidents and car and person accidents on expressways and urban roads becomes a problem to be solved.

The detection mode of the traffic accident at the present stage mainly captures the accident vehicles through the traffic cameras on the roads, and the adopted algorithm mainly adopts the detection, tracking and post-processing modes, but because the accident scenes are various, the accident types are different, if various accidents are defined only by manpower, the detection precision is poor, the time and the labor are consumed, and the requirements of wide scenes cannot be met.

Disclosure of Invention

The application provides an accident detection method and equipment based on a multi-mode large model, which are used for solving the problems that in the prior art, the accident detection precision is poor, the time consumption is long and the requirements of a wide scene cannot be met.

In a first aspect, an embodiment of the present application provides an accident detection method based on a multi-mode large model, where the method includes:

Inputting a target image to be detected and a preset number of other images adjacent to the acquisition time of the target image into an accident detection small model, and acquiring the target confidence coefficient of the traffic accident in the target image output by the accident detection small model;

if the target confidence coefficient belongs to a preset confidence coefficient range, inputting the target image into a multi-mode large model for detecting traffic accidents, and acquiring a detection result of whether the traffic accidents exist in the target image output by the multi-mode large model.

In a second aspect, an embodiment of the present application further provides a data set construction method for fine-tuning training any one of the foregoing multi-modal large models, where the method includes:

Inputting a sample image into a first multi-mode large model, and acquiring an overall description corresponding to the sample image output by the first multi-mode large model;

Inputting the sample image into a second multi-mode large model, and obtaining the sample image which is output by the second multi-mode large model and is marked with a detection frame corresponding to each region and a region description corresponding to each region;

determining a target description of the sample image according to the overall description and the region description corresponding to each region;

And acquiring input accident question and answer information determined based on the target description, and correspondingly storing the sample image and the accident question and answer information.

In a third aspect, an embodiment of the present application further provides an electronic device, where the electronic device includes a processor, where the processor is configured to implement an accident detection method based on a multimodal big model as described above or implement the steps of a data set construction method as described above when executing a computer program stored in a memory.

In the embodiment of the application, the electronic equipment inputs a target image to be detected and a preset number of other images adjacent to the acquisition time of the target image into an accident detection small model, and obtains the target confidence coefficient of the traffic accident in the target image output by the accident detection small model; if the target confidence coefficient belongs to the preset confidence coefficient range, inputting the target image into a multi-mode large model for detecting traffic accidents, and acquiring a detection result of whether the traffic accidents exist in the target image output by the multi-mode large model. In the embodiment of the application, the detection of the traffic accident is realized by combining the size models, and the accident detection precision is improved.

Drawings

In order to more clearly illustrate the technical solutions of the present application, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of an accident detection process based on a multi-mode large model according to an embodiment of the present application;

FIG. 2 is a schematic diagram illustrating a comparison of a conventional convolution and a hole convolution according to an embodiment of the present disclosure;

FIG. 3 is a flowchart of an application of a visual encoder according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a data set construction process according to an embodiment of the present application;

FIG. 5 is a schematic diagram of a sample image according to an embodiment of the present application;

FIG. 6 is a schematic diagram of a sample image provided by an embodiment of the present application and output through a second multi-modal large model;

FIG. 7 is a schematic diagram of a data set construction flow provided in an embodiment of the present application;

FIG. 8 is a flow chart of a traffic accident detection method for size model decision provided by an embodiment of the present application;

FIG. 9 is a schematic structural diagram of an accident detection device based on a multi-mode large model according to an embodiment of the present application;

FIG. 10 is a schematic diagram of a data set constructing apparatus according to an embodiment of the present application;

fig. 11 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be described in further detail below with reference to the accompanying drawings, and it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

In order to improve the accident detection precision, the embodiment of the application provides an accident detection method and equipment based on a multi-mode large model.

In the embodiment of the application, a target image to be detected and a preset number of other images adjacent to the acquisition time of the target image are input into an accident detection small model, and the target confidence coefficient of the traffic accident in the target image output by the accident detection small model is obtained; if the target confidence coefficient belongs to the preset confidence coefficient range, inputting the target image into a multi-mode large model for detecting traffic accidents, and acquiring a detection result of whether the traffic accidents exist in the target image output by the multi-mode large model.

Fig. 1 is a schematic diagram of an identification process based on a multi-mode large model according to an embodiment of the present application, where the process includes:

S101: inputting a target image to be detected and a preset number of other images adjacent to the acquisition time of the target image into an accident detection small model, and obtaining the target confidence coefficient of the traffic accident in the target image output by the accident detection small model.

The accident detection method based on the multi-mode large model is applied to electronic equipment, and the electronic equipment can be a PC or a server.

In the traffic accident detection method based on the small model, firstly, vehicles in images are detected, then, the positions of the vehicles in the next frame of images are predicted according to a Kalman filtering algorithm, and the optimal matching of target vehicles is obtained through a Hungary matching algorithm, so that the vehicles are tracked, when the vehicles have traffic accidents, the accident vehicles can be stopped, and then, the positions and the states of the vehicles are combined and the vehicles are checked by people to judge whether the traffic accidents happen. However, due to various traffic accidents and different situations, the accident discrimination precision is poor, and the time and the labor are consumed by the mode of target detection, tracking and business logic.

Because the multi-mode large model has strong understanding capability on the image, whether traffic accidents occur in the image or not can be directly obtained in a command question-and-answer mode, and the detection accuracy is high, but the multi-mode large model has large parameter quantity, high resource occupation and long detection time, and each frame of image cannot be processed in real time.

Based on the above, the embodiment of the application provides a method for detecting traffic accidents based on a size model. The electronic equipment firstly adopts a small model to determine the target confidence coefficient of the traffic accident in the target image to be detected, and then further determines whether the accident detection is required to be carried out by adopting a multi-mode large model according to the target confidence coefficient.

Specifically, in the embodiment of the application, the electronic equipment collects the video stream, decodes the video stream, and performs picture frame extraction according to the set algorithm running frame rate to obtain a plurality of images. The electronic device may determine, from the plurality of video frames, a target image to be detected according to the input, and determine a preset number of other images of the plurality of video frames adjacent to the acquisition time of the target image. The other images are used for tracking the targets in the target image and determining the motion trail of the targets in the target image.

The electronic equipment inputs the target image and a preset number of other images into an accident detection small model, the accident detection small model carries out target detection on the target image, each target contained in the target image is determined, a motion track corresponding to each target is determined according to the other images, and the target confidence level of the traffic accident in the target image is determined according to the motion estimation corresponding to each target.

In the embodiment of the present application, each target included in the target image may be a person, a car, or the like.

S102: if the target confidence coefficient belongs to a preset confidence coefficient range, inputting the target image into a multi-mode large model for detecting traffic accidents, and acquiring a detection result of whether the traffic accidents exist in the target image output by the multi-mode large model.

In the embodiment of the application, the electronic equipment can determine whether the traffic accident exists in the target image according to the target confidence level output by the accident detection small model. Specifically, a confidence coefficient range is stored in the electronic device, if the electronic device determines that the target confidence coefficient is smaller than the minimum value of the confidence coefficient range, the electronic device determines that no traffic accident exists in the target image, and if the electronic device determines that the target confidence coefficient is larger than the maximum value of the confidence coefficient range, the electronic device determines that the traffic accident exists in the target image.

And if the electronic equipment determines that the target confidence coefficient output by the accident detection small model belongs to the preset confidence coefficient range, the electronic equipment determines that the accident detection small model cannot accurately identify whether the traffic accident exists in the target image. Based on the detection, the electronic device further detects the target image by adopting a multi-mode large model for detecting the traffic accident.

Specifically, in the embodiment of the application, if the electronic device determines that the target confidence coefficient output by the accident detection small model is in the preset confidence coefficient range, the electronic device inputs the target image into the multi-mode large model for detecting the traffic accident, the multi-mode large model identifies the target image, determines whether the traffic accident exists in the target image, and the multi-mode large model outputs the detection result of whether the traffic accident exists in the target image. And the electronic equipment acquires a detection result output by the multi-mode large model.

In the embodiment of the application, the electronic equipment inputs a target image to be detected and a preset number of other images adjacent to the acquisition time of the target image into an accident detection small model, and obtains the target confidence coefficient of the traffic accident in the target image output by the accident detection small model; if the target confidence coefficient belongs to the preset confidence coefficient range, inputting the target image into a multi-mode large model for detecting traffic accidents, and acquiring a detection result of whether the traffic accidents exist in the target image output by the multi-mode large model. In the embodiment of the application, the detection of the traffic accident is realized by combining the size models, and the accuracy of the traffic accident detection is improved.

In the embodiment of the application, for the pictures of the same traffic accident scene, even though accident events can be detected in different running environments, the pictures accord with the characteristic of reproducibility in the credibility characteristic; in addition, the embodiment of the application can further determine the detection degree of the accident by setting the hyper-parameters (confidence coefficient range) by the user, the importance degree and the adopted degree of the event detection result can be interfered by the user, and the embodiment accords with the characteristic of controllability in the credibility characteristic.

In order to realize the detection of traffic accidents and improve the accuracy of the detection of traffic accidents, in the embodiment of the application, the multi-modal large model comprises a visual encoder, a position-aware visual language adapter and a language large model;

the step of obtaining the detection result of whether the traffic accident exists in the target image output by the multi-mode large model comprises the following steps:

The visual encoder encodes the target image and determines a feature code corresponding to the target image;

the position-aware visual language adapter compresses the feature codes and determines the compressed feature codes;

And the language big model determines and outputs a detection result of whether the traffic accident exists in the target image according to the compressed feature codes.

In an embodiment of the application, the multimodal big model includes a visual encoder, a location aware visual language adapter, and a language big model. The visual encoder is used for cutting and feature coding an input target image, the position-aware visual language adapter is used for feature alignment, compressing the feature coding output by the visual encoder, and the language big model is used for outputting a detection result of whether traffic accidents exist in the target image according to the compressed feature coding.

For example, the multimodal mass model may be Qwen-VL model, which Qwen-VL model is a transducer-based multimodal mass model with a network structure consisting of three parts in total: the first part is a visual encoder, which adopts VIT-bigG as the initialization of the pre-training weight and is used for clipping the input image and performing feature coding; the second part is a feature alignment module called a position-aware visual language adapter, comprising a single-layer cross attention module, which is mainly used for encoding the compressed image features; the third part is a language big model, and Qwen-7b is used as the initialization of the model of the pre-training weight to provide language generating capability for the multi-mode big model.

In order to realize the detection of the traffic accident and improve the accuracy of the traffic accident detection, in the embodiments of the present application, the visual encoder includes a convolution layer, a linear projection layer and a transducer layer;

the encoding the target image, and determining the feature codes corresponding to the target image comprises the following steps:

The convolution layer adopts cavity convolution to conduct feature extraction on the target image, determines a feature map corresponding to the target image, and segments the feature map to obtain each sub-feature map; wherein, the parameters of the cavity convolution are the parameters learned in the model training process;

The linear projection layer sorts each sub-feature map according to the segmentation sequence of the feature map, and carries out linear coding on each sorted sub-feature map to determine the image features corresponding to the feature map;

and the transducer layer performs feature extraction on the image features and determines the feature codes.

In an embodiment of the application, the visual encoder includes a convolutional layer, a linear projection layer, and a transducer layer. The traditional convolution layer adopts the traditional convolution method to extract the characteristics of the input image, then the image is segmented into partial small images, the method only carries out simple segmentation on a single image layer, the characteristics in the image are rich, the details of targets at different positions are different, and therefore the simple segmentation method can segment the position information in the image.

Based on this, in the embodiment of the present application, the feature extraction is performed on the target image by using the hole convolution, and since the hole convolution can increase the receptive field of the image, more image features can be input into the linear projection layer.

Fig. 2 is a schematic diagram comparing a conventional convolution with a hole convolution, where, as shown in fig. 2, a gray part is a common convolution kernel, b and c are hole convolutions, and it can be seen that different holes are different, where the image features acquired in a are the least, the image features acquired in b are the next, and the image features acquired in c are the most.

In the embodiment of the application, the parameter of the hole convolution is the hole value of the hole convolution, and the traditional method for determining the hole value is to set the hole value d to different probabilities by a manual mode, for example, the probability of d being equal to 1 is 0.24, the probability of d being equal to 2 is 0.17, and the probability gradually becomes smaller along with the increase of d, but the size of d is determined to contain a certain subjective factor by the manual mode, so that a model cannot be optimized well. Based on this, in the embodiment of the present application, it is proposed to set the hole value of the hole convolution as a parameter that can be learned, so that the network model automatically corrects the size of the hole value according to the error loss function and the optimizer.

Wherein the cross entropy loss function is as follows:

wherein y is the real detection result of the sample image, and p is the predicted detection result.

Adam optimizer is as follows:

Wherein, Is the first order momentum of the momentum,Is the momentum of the second order,AndThe optimization is performed for the average value,Is a weight update.

Based on the above, in the embodiment of the application, after the training of the visual encoder is completed, in the actual application process, the convolution layer of the visual encoder adopts cavity convolution to perform feature extraction on the target image, determines the feature map corresponding to the target image, and segments the feature map to obtain each sub-feature map; the parameters of the cavity convolution are cavity values, which are learned in the model training process. The linear projection layer of the visual encoder sorts each sub-feature map according to the segmentation sequence of the feature map, and carries out linear encoding on each sorted sub-feature map to determine the image features corresponding to the feature map; the transducer layer of the visual encoder performs feature extraction on the image features to determine feature codes.

Fig. 3 is an application flowchart of a visual encoder according to an embodiment of the present application, where, as shown in fig. 3, the process includes:

Step 1: a target image, which is a color image of 1920× 1080,3 channels, is input, and a hole value d of the hole convolution is initialized.

Step 2: and carrying out hole convolution operation on the target image to obtain a characteristic image with the size of 224 multiplied by 224, dividing the length and the width of the characteristic image by 16 respectively to form 196 image blocks with the size of 16 multiplied by 16, arranging the image blocks into a sequence with the dimension of 768 of each image block, and inputting the image blocks into a linear projection layer to obtain the characteristics of a single image block.

Step 3: the category sign is taken as a tensor with the size of 1 multiplied by 68 and can be learned, meanwhile, the position code with the size of 1 multiplied by 768 which can be learned is added, the image characteristics after the position code and the linear code are added according to the bit, and the image characteristics and the category sign are spliced together and input into a transducer layer.

Step 4: the transform layer is composed of multiple coding layers, each coding layer is composed of a multi-head attention layer and a multi-layer perceptron, the coding layer contains L layers, the data in 197×768 dimensions obtained in the step 3 is normalized and then input into the multi-head attention for calculation, and the feature codes with the result of 197×768 dimensions are obtained. The obtained 197×768 feature codes are normalized and then input into a multi-layer perceptron layer, the input feature codes are amplified by 4 times to become 197×3072, then the feature codes with the dimension of 197×768 are obtained through a linear projection layer, and the feature codes are output.

In order to realize the detection of traffic accidents and improve the accuracy of the detection of the traffic accidents, in the embodiment of the application, the small accident detection model comprises a target detection sub-model, a target tracking sub-model and an accident detection sub-model;

the step of obtaining the target confidence coefficient of the traffic accident in the target image output by the accident detection small model comprises the following steps:

the target detection sub-model respectively carries out target detection on the target image and the other images, determines each target contained in the target image and the other images, and marks each target in the target image and the other images; wherein the target is a vehicle or a pedestrian;

The target tracking sub-model determines the motion trail of each target according to the marked target image and other images;

And the accident detection sub-model carries out accident detection according to the motion track of each target, and determines and outputs the target confidence coefficient of the traffic accident in the target image.

In the embodiment of the application, the accident detection small model is integrated by a plurality of sub-models, and at least comprises a target detection sub-model, a target tracking sub-model and an accident detection sub-model. The object detection sub-model is used for identifying each object contained in the object image and other images, the object tracking sub-model is used for determining the motion track of each object, and the accident detection sub-model is used for determining the object confidence level of the traffic accident in the object image according to the motion track.

Specifically, in the embodiment of the application, the object detection sub-model respectively detects objects of the object image and other images, determines each object contained in the object image and other images, and marks each object in the object image and other images. And the target tracking sub-model determines the motion trail of each target according to the marked target image and other images. And the accident detection sub-model carries out accident detection according to the motion track of each target, and determines and outputs the target confidence coefficient of the traffic accident in the target image.

The object detection sub-model may be a model using Yolov s algorithm, and the object tracking sub-model may be a model using DeepSort tracking algorithm.

In the embodiment of the application, the accident detection sub-model can determine the target confidence coefficient of the traffic accident in the target image according to the following conditions:

Condition 1: the vehicles are juxtaposed with each other in a shielding relationship, and the shielding duration exceeds a time threshold for 3 minutes;

Condition 2: the rear vehicle is blocked and is not moved for more than 120 seconds;

condition 3: the rear vehicle bypasses from the rear part and slowly bypasses;

condition 4: two accident vehicles are opened, and people get off;

condition 5: the person stands around the car for more than 60 seconds.

The confidence coefficient corresponding to each condition is 0.2, the accident detection sub-model determines the number of conditions met by the target image according to the motion track of each target, and determines the target confidence coefficient of the traffic accident in the target image according to the number of conditions and the confidence coefficient corresponding to each condition.

In order to realize the detection of the traffic accident and improve the accuracy of the traffic accident detection, on the basis of the above embodiments, in the embodiments of the present application, the method further includes:

if the target confidence coefficient exceeds the maximum value of the confidence coefficient range, determining that a traffic accident exists in the target image, and carrying out accident alarm;

if the target confidence coefficient is smaller than the minimum value of the confidence coefficient range, determining that no traffic accident exists in the target image, and not carrying out accident alarming.

In the embodiment of the application, a confidence coefficient range is pre-stored in the electronic equipment, if the confidence coefficient output by the accident detection small model is attributed to the confidence coefficient range, the electronic equipment determines that whether the small model can not accurately identify whether the traffic accident exists in the target image, and the electronic equipment inputs the target image into the multi-mode large model for further prediction.

On the basis, if the target confidence coefficient output by the accident detection small model exceeds the maximum value of the confidence coefficient range, the electronic equipment determines that the traffic accident exists in the target image, and the electronic equipment alarms the accident; if the target confidence coefficient output by the accident detection small model is smaller than the minimum value of the confidence coefficient range, the electronic equipment determines that no traffic accident exists in the target image, and the electronic equipment does not alarm the accident.

Generally, the confidence coefficient range can be 0.3-0.7, if the confidence coefficient is greater than 0.7, traffic accidents can be directly judged, if the confidence coefficient is greater than 0.3 and less than 0.7, the target image is input into the multi-mode large model for decision, and if the confidence coefficient is less than 0.3, no traffic accident is considered to occur.

In order to realize the detection of traffic accidents and improve the accuracy of the detection of the traffic accidents, the embodiment of the application also provides a data set construction method based on the above embodiments, which is used for training the multi-mode large model in the above embodiments. Fig. 4 is a schematic diagram of a data set construction process according to an embodiment of the present application, where the process includes:

s401: and inputting the sample image into a first multi-mode large model, and acquiring the integral description corresponding to the sample image output by the first multi-mode large model.

The data set construction method provided by the embodiment of the application is applied to electronic equipment, and the electronic equipment can be a PC (personal computer) and a server.

In the related art, two main approaches are adopted for constructing the data set: manual labeling and large model labeling. The manual marking is to give an answer manually, but the variety of traffic accidents is various and cannot be exhausted, and when a plurality of workers label, the consistency of the given expression is poor due to different understanding of the accidents in the images, so the manual marking has the defects of time and labor waste, poor marking consistency and the like; based on the data annotation of the large model, answers are given directly by means of the large models such as GPT4, and when fine adjustment training is carried out, the model to be fine adjusted cannot understand why a traffic accident occurs.

Based on the method, in order to better label the images, obtain a data set with more accurate labels, and perform fine tuning training on the multi-mode large model for accident detection based on the data set, in the embodiment of the application, the understanding capability of the multi-mode large model on the images is combined with the detection of the small model, the images are labeled from multiple layers and multiple images, descriptions with consistent expression styles are generated, thinking chains are introduced into the data labels, the large model understands the intention of human beings through one-step prompting and answering, and finally, the verification or sampling detection is performed manually.

Specifically, in the embodiment of the application, the electronic device inputs the sample image into the first multi-mode large model, so that the first multi-mode large model identifies the whole of the sample image and outputs the whole description. And the electronic equipment acquires the integral description corresponding to the sample image output by the first multi-mode large model.

Fig. 5 is a schematic diagram of a sample image provided by an embodiment of the present application, where, as shown in fig. 5, the electronic device inputs the sample image into a first multi-mode large model, and the whole of the output of the first multi-mode large model is described as "many vehicles are on roads of a city".

S402: and inputting the sample image into a second multi-mode large model, and acquiring the sample image which is output by the second multi-mode large model and is marked with a detection frame corresponding to each region and the region description corresponding to each region.

In the embodiment of the application, the electronic equipment inputs the sample image into a second multi-mode large model, and the second multi-mode large model detects each region in the sample image and generates a corresponding region description for each region.

In the embodiment of the application, the region in the sample image can be automatically identified by the second multi-mode large model, each region in the sample image can be identified and marked by the small model, the marked sample image is input into the second multi-mode large model, and the region description corresponding to each region is output by the second multi-mode large model.

Fig. 6 is a schematic diagram of a sample image and an output of the sample image through a second multi-mode large model according to an embodiment of the present application, where after the sample image is input into the second multi-mode large model, the second multi-mode large model determines each region included in the sample image, and a region description corresponding to each region is marked in the sample image. The region description includes: "vehicle is white", "a man walks on the road", "a silvery vehicle stops on the road" and "a person stands on the road", the location of the zone boxes are: [806,599,961,796], [759,644,828,848], [872,762,1242,1032] and [1089,867,1182,1121].

S403: determining a target description of the sample image according to the overall description and the region description corresponding to each region; and acquiring input accident question and answer information determined based on the target description, and correspondingly storing the sample image and the accident question and answer information.

In the embodiment of the application, the electronic equipment determines the target description of the sample image according to the integral description of the sample image, the sample image of the detection frame corresponding to each marked area in the sample image and the area description corresponding to each area.

In the embodiment of the application, the electronic equipment can determine the target description corresponding to the sample image through the language big model. Specifically, the electronic device inputs the overall description, the region description corresponding to each region, and a prompt word for prompting the large model to generate detailed description based on the overall description and the region description into a language large model, and the language large model generates and outputs a target description for the sample image according to the overall description and the region description corresponding to each region.

In the embodiment of the application, after the electronic equipment determines the target description of the sample image, the electronic equipment displays the target description, and technicians determine accident question-answer information based on the target description in a form of a thinking chain according to the target description, and correspondingly store the sample image and the accident question-answer information.

On the basis of fig. 6, the electronic device obtains accident question-answer information as follows:

1. is there a vehicle broken?

Answering: in the figure, the silvery white sedan is closer to the gray minibus, but no obvious damage condition is seen.

2. Is someone upside down?

Answering: no someone was found to fall over.

3. Is there a driver or passenger getting off?

Answering: two people in the figure look around the vehicle, suspected to be observing the damage of the vehicle.

4. And analyzing whether traffic accidents exist in the images according to the description.

Answering: because two vehicles are closer in distance on a road covered by snow, and people look around the vehicles, other vehicles on the road have a tendency to avoid the vehicles, so traffic accidents are considered to exist.

In addition, in the embodiment of the application, aiming at the data set containing the marked sample image, part of the sample image and the corresponding target description are checked in a manual checking or spot checking mode, the checking accuracy is determined, if the accuracy reaches the preset requirement, the data set is adopted to perform the fine tuning training of the multi-mode large model, and if the accuracy does not reach the preset requirement, the sample image is marked again.

On the basis, the electronic equipment can also check part of the sample images and the corresponding descriptions in a manual check or spot check mode, and if the target description of the spot check is inconsistent with the content in the corresponding sample images, the sample images are marked again.

In the embodiment of the application, the multi-image is marked from a plurality of layers based on multi-scale feature fusion, a thinking chain and instruction marking, the description with consistent expression style is generated, the thinking chain is introduced into the data marking, and the intention of a human is understood by the multi-mode big model through step-by-step prompt and answer.

In order to improve the accuracy of labeling of a dataset, in the embodiments of the present application, before determining the target description of the sample image according to the overall description and the region description corresponding to each region, the method further includes:

Inputting the sample image marked with the detection frame corresponding to each region and the region description corresponding to each region into the first multi-mode large model, and obtaining the similarity score of each region output by the first multi-mode large model and the corresponding region description;

And deleting the region with the similarity score lower than the preset threshold value and the corresponding region description.

In the embodiment of the application, the electronic equipment can also screen the determined region description corresponding to each region through the first multi-mode large model, and reject the region description with low matching degree.

Specifically, the electronic device inputs the sample image marked with the detection frame corresponding to each region and the region description corresponding to each region into the first multi-mode large model, obtains the similarity score of each region and the corresponding region description output by the first multi-mode large model, and deletes the region with the similarity score lower than the preset threshold and the corresponding region description.

On the basis of fig. 6, the first multi-modal large model determines that the similarity score of the region description "car is white" is 0.95, the similarity score of the region description "one man walks on the road" is 0.7, the similarity score of the region description "one silvery car stops on the road" is 0.8, and the similarity score of the region description "one stands on the road" is 0.4. The preset threshold value stored in the electronic device is 0.6, and the electronic device deletes the region description "one person stands on the road".

In order to improve the accuracy of labeling of a dataset, in the embodiments of the present application, determining, according to the overall description and the region description corresponding to each region, the target description of the sample image includes:

And inputting the integral description and the region description corresponding to each region into a language big model, and acquiring the target description of the sample image output by the language big model.

In order to improve the accuracy of the labeling of the dataset, in the embodiments of the present application, the first multi-modal large model is a BLIP2 model, the second multi-modal large model is a GRIP model, and the language large model is a GPT-3.5 model.

In the embodiment of the application, the first multi-mode large model is a BLIP2 model, the second multi-mode large model is a GRIP model, and the language large model is a GPT-3.5 model.

On the basis of the foregoing embodiments, fig. 7 is a schematic diagram of a data set construction flow provided by an embodiment of the present application, where, as shown in fig. 7, the process includes:

step 1: and for a traffic accident image, carrying out global image description on the whole image by using a multi-mode large model BLIP2 to obtain overall description. For example: many vehicles travel on urban roads.

Step 2: and detecting each region in the graph by using the multi-mode large model GRIP, and carrying out region image description on each region to obtain region description. For example: the region description includes: "vehicle is white", "a man walks on the road", "a silvery vehicle stops on the road" and "a person stands on the road", the location of the zone boxes are: [806,599,961,796], [759,644,828,848], [872,762,1242,1032] and [1089,867,1182,1121].

Step 3: and (3) calculating the matching degree of the region frame and the region description in the step (2) by using BLIP2 to obtain a similarity score, and removing the detection frame and the region description with lower confidence coefficient by manually setting a score threshold.

For example, the area left is described as "car is white" and "a man walks on the road".

Step 4: the whole description in the step 1, the filtered detection frame images in the steps 2 and 3 and the corresponding description are comprehensively considered by using GPT-3.5, and an image detailed description (target description) and a thinking chain image question-answer (accident question-answer information) for the whole image are generated.

For example, the object is described as "snow covered city two-way lanes, on one of which two vehicles have a rear-end collision accident, and people observe around the accident vehicle".

Step 5: after step 4, the GPT-3.5 model has understood the meaning of the scene in the image, and then constructs the instruction data set in the form of a thinking chain, including the detailed description of the image of step 1-4, and accident question-answer information such as the combination of the surrounding environment, the position of the vehicle, the condition between vehicles, the vehicle breakage condition, the road congestion condition, the driver's condition and everything you see, based on which the following questions are answered:

1. Is there a vehicle broken?

2. Is someone upside down?

Answering: no someone was found to fall over.

3. Is there a driver or passenger getting off?

4: Is there a traffic accident in the analysis image according to the description?

Step 6: and (5) checking part of sample images and corresponding target descriptions in the data set through a manual checking or spot check mode aiming at the marked data set, if the content in the marked target descriptions corresponding to the sample images is consistent, using the sample images for subsequent multi-mode large model fine tuning training, and if the requirements are not met, returning to the step (5) to carry out marking again.

On the basis of the above embodiments, fig. 8 is a flowchart of a traffic accident detection method for size model decision according to an embodiment of the present application, where the process includes:

step 1: and obtaining a video stream, performing video decoding on the video stream, performing video frame extraction according to a set algorithm operation frame rate, and determining a target image and other images.

Step 2: and inputting the target image and other images into a small model of traffic accident detection, wherein Yolov s is used for target detection, deepSort is used for target tracking, and the target confidence of the traffic accident in the target image is determined according to the traffic accident judgment basis.

If the target confidence coefficient exceeds the maximum value of the confidence coefficient range, determining that a traffic accident exists in the target image, and carrying out traffic accident alarming; if the target confidence coefficient is smaller than the minimum value of the confidence coefficient range, determining that no accident exists in the target image, and not alarming the accident.

Step 3: if the target confidence belongs to the preset confidence range, whether traffic data exists in the target image is not determined, the electronic equipment inputs the target image into the micro-tuned multi-mode large model, and the target image is prompted through a prefabricated problem, such as: in the figure, whether a traffic accident occurs or not is answered by yes or no. The large model makes inferences based on images and cues, and finally gives answers such as: if yes, traffic accidents occur or not in the diagram, and no traffic accidents occur in the diagram.

Step 4: and if the multi-mode large model detects the traffic accident, outputting an alarm event, and if the multi-mode large model does not detect the traffic accident, repeating the step 2 to continue the traffic accident monitoring.

Step 5: and optimizing the model structure by adopting a multi-instruction data set to obtain a finely tuned multi-mode large model.

On the basis of the above embodiments, fig. 9 is a schematic structural diagram of an accident detection apparatus based on a multi-mode large model according to an embodiment of the present application, where the apparatus includes:

The small model detection module 901 is configured to input a target image to be detected and a preset number of other images adjacent to the acquisition time of the target image into an accident detection small model, and obtain a target confidence coefficient of a traffic accident in the target image output by the accident detection small model;

The large model detection module 902 is configured to input the target image into a multi-mode large model for detecting a traffic accident if the target confidence coefficient belongs to a preset confidence coefficient range, and obtain a detection result of whether the traffic accident exists in the target image output by the multi-mode large model.

In one possible implementation, the multimodal mass model includes a visual encoder, a location-aware visual language adapter, and a language mass model;

The large model detection module 902 is specifically configured to encode the target image by using the visual encoder, and determine a feature code corresponding to the target image; the position-aware visual language adapter compresses the feature codes and determines the compressed feature codes; and the language big model determines and outputs a detection result of whether the traffic accident exists in the target image according to the compressed feature codes.

In one possible implementation, the visual encoder includes a convolutional layer, a linear projection layer, and a transducer layer;

The large model detection module 902 is specifically configured to perform feature extraction on the target image by using cavity convolution on the convolution layer, determine a feature map corresponding to the target image, and segment the feature map to obtain each sub-feature map; wherein, the parameters of the cavity convolution are the parameters learned in the model training process; the linear projection layer sorts each sub-feature map according to the segmentation sequence of the feature map, and carries out linear coding on each sorted sub-feature map to determine the image features corresponding to the feature map; and the transducer layer performs feature extraction on the image features and determines the feature codes.

In one possible implementation, the accident detection small model includes a target detection sub-model, a target tracking sub-model, and an accident detection sub-model;

The small model detection module 901 is specifically configured to perform object detection on the object image and the other images by using the object detection sub-model, determine each object included in the object image and the other images, and mark each object in the object image and the other images; wherein the target is a vehicle or a pedestrian; the target tracking sub-model determines the motion trail of each target according to the marked target image and other images; and the accident detection sub-model carries out accident detection according to the motion track of each target, and determines and outputs the target confidence coefficient of the traffic accident in the target image.

In a possible implementation manner, the small model detection module 901 is further configured to determine that a traffic accident exists in the target image and perform an accident alarm if the target confidence coefficient exceeds a maximum value of the confidence coefficient range; if the target confidence coefficient is smaller than the minimum value of the confidence coefficient range, determining that no traffic accident exists in the target image, and not carrying out accident alarming.

On the basis of the foregoing embodiments, fig. 10 is a schematic structural diagram of a data set construction apparatus according to an embodiment of the present application, where the apparatus includes:

A processing module 1001, configured to input a sample image into a first multi-mode large model, and obtain an overall description corresponding to the sample image output by the first multi-mode large model; inputting the sample image into a second multi-mode large model, and obtaining the sample image which is output by the second multi-mode large model and is marked with a detection frame corresponding to each region and a region description corresponding to each region; determining a target description of the sample image according to the overall description and the region description corresponding to each region;

and a construction module 1002, configured to obtain input accident question and answer information determined based on the target description, and store the sample image and the accident question and answer information correspondingly.

In a possible implementation manner, the processing module 1001 is further configured to input the sample image marked with the detection frame corresponding to each region and the region description corresponding to each region into the first multi-mode large model, and obtain a similarity score of each region output by the first multi-mode large model and the corresponding region description; and deleting the region with the similarity score lower than the preset threshold value and the corresponding region description.

In a possible implementation manner, the processing module 1001 is specifically configured to input the sample image marked with the detection frame corresponding to each region, the overall description, and the region description corresponding to each region into a language big model, and obtain a target description of the sample image output by the language big model.

In one possible implementation, the first multi-modal large model is a BLIP2 model, the second multi-modal large model is a GRIP model, and the language large model is a GPT-3.5 model.

On the basis of the foregoing embodiments, the embodiment of the present application further provides an electronic device, and fig. 11 is a schematic structural diagram of the electronic device provided by the embodiment of the present application, as shown in fig. 11, including: the device comprises a processor 1101, a communication interface 1102, a memory 1103 and a communication bus 1104, wherein the processor 1101, the communication interface 1102 and the memory 1103 are in communication with each other through the communication bus 1104;

The memory 1103 stores a computer program therein, which when executed by the processor 1101 causes the processor 1101 to execute the steps of the interface generation method provided by the above embodiments.

Because the principle of solving the problem of the electronic device is similar to that of the accident detection method or the data set construction method based on the multi-mode large model, the implementation of the electronic device can refer to the embodiment of the method, and the repetition is not repeated.

Claims

1. An accident detection method based on a multi-mode large model, which is characterized by comprising the following steps:

2. The method of claim 1, wherein the multimodal mass model includes a visual encoder, a location-aware visual language adapter, and a language mass model;

3. The method of claim 2, wherein the visual encoder comprises a convolutional layer, a linear projection layer, and a transducer layer;

4. The method of claim 1, wherein the incident detection small model comprises a target detection sub-model, a target tracking sub-model, and an incident detection sub-model;

the target detection sub-model respectively carries out target detection on the target image and the other images, determines each target contained in the target image and the other images, and marks each target in the target image and the other images; wherein each target is a vehicle or a pedestrian;

5. The method according to claim 1, wherein the method further comprises:

6. A method of data set construction for fine-tuning training a multimodal big model according to any of claims 1-5, the method comprising:

7. The method of claim 6, wherein prior to determining the target description for the sample image based on the overall description and the region descriptions for each region, the method further comprises:

8. The method according to claim 6 or 7, wherein determining the target description of the sample image from the overall description, the sample image labeled with the detection frame corresponding to each region, and the region description corresponding to each region comprises:

And inputting the sample image marked with the detection frame corresponding to each region, the overall description and the region description corresponding to each region into a language big model, and obtaining the target description of the sample image output by the language big model.

9. The method of claim 8, wherein the first multi-modal large model is a BLIP2 model, the second multi-modal large model is a GRIP model, and the language large model is a GPT-3.5 model.

10. An electronic device comprising a processor for implementing the method for multi-modal large model-based accident detection as claimed in any one of claims 1-5 or the steps of the method for dataset construction as claimed in any one of claims 6-9 when executing a computer program stored in a memory.