CN116778438B

CN116778438B - Illegal forklift detection method and system based on large language model

Info

Publication number: CN116778438B
Application number: CN202311034299.8A
Authority: CN
Inventors: 徐晓康; 沈钰峰; 华绿绿; 黄健鹏
Original assignee: Suzhou Zhedoshan Technology Co ltd; Shengze Town People's Government Of Wujiang District Suzhou City
Current assignee: Suzhou Zhedoshan Technology Co ltd; Shengze Town People's Government Of Wujiang District Suzhou City
Priority date: 2023-08-17
Filing date: 2023-08-17
Publication date: 2023-11-14
Anticipated expiration: 2043-08-17
Also published as: CN116778438A

Abstract

The invention provides a large language model-based illegal forklift detection method and system. The detection method comprises the following steps: s1, acquiring a history picture, and performing data processing and classification on the history picture to obtain a forklift data set; the forklift data set comprises forklift pictures and non-forklift pictures, and each picture corresponds to one text label. S2, acquiring a training data set, and inputting the training data set into a pre-constructed multi-modal feature alignment model and a large language model to obtain a pre-training multi-modal feature alignment model and a pre-training large language model. S3, combining the pre-training multi-modal feature alignment model and the pre-training large language model, and inputting a forklift data set for fine adjustment and combination to obtain a combined model. S4, quantizing the combined model, and inputting the picture to be detected into the quantized combined model to obtain the illegal road information of the forklift. According to the invention, the pictures are identified through the combined model to obtain the illegal road information of the forklift, so that the efficiency and the detection accuracy are high, and the error rate is low.

Description

Illegal forklift detection method and system based on large language model

Technical Field

The invention relates to the technical field of forklift detection, in particular to a large language model-based illegal forklift detection method and system.

Background

The forklift is not in the category of motor vehicles, is special equipment and can not run on roads according to regulations. Due to the particularities of the forklift in the aspects of light, structure, braking, mechanical performance and the like, the forklift has great 'attack capability' once the forklift is on the road, has great potential safety hazard and is extremely easy to cause traffic accidents or other accidents.

Thus, the forklift must be carried to the destination by other delivery vehicles, if a factory job is required. However, the operator saves trouble for the drawing and drives the forklift to get on the road. However, how to find and dispose of in time is a matter of effort for the phenomenon of forklift access. The regular inspection by the person not only requires a lot of labor cost, but also has limited inspection time and is difficult to maintain for a long time. The mode of monitoring by the camera and manually checking the video can reduce the labor cost, but the large number of roads means that the number of the monitoring videos is huge, the workload of manually checking the monitoring videos is still complicated, and the long-time staring monitoring videos are easy to misplug and miss, so that the checking accuracy is difficult to guarantee.

Disclosure of Invention

Based on the above, it is necessary to provide a method and a system for detecting the illegal forklift based on a large language model, aiming at the problem that the existing detection mode is difficult to timely and effectively find out the illegal way of the forklift.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

a large language model-based illegal forklift detection method comprises the following steps:

s1, acquiring a history picture, and performing data processing and classification on the history picture to obtain a forklift data set; the forklift data set comprises forklift pictures and non-forklift pictures, and each picture corresponds to one text label;

s2, acquiring a training data set, and inputting the training data set into a pre-constructed multi-modal feature alignment model and a large language model to obtain a pre-training multi-modal feature alignment model and a pre-training large language model;

s3, combining the pre-training multi-modal feature alignment model and the pre-training large language model, and inputting a forklift data set for fine adjustment and combination to obtain a combined model; the specific steps of fine tuning the combined pre-training multi-modal feature alignment model and the pre-training large language model are as follows:

s31, inputting a forklift data set into a pre-training multi-mode feature alignment model to obtain 1536-dimensional picture semantic features;

s32, setting instruction questions by taking text labels contained in the forklift data set as answers, and encoding the instruction questions into 512-dimensional question features;

s33, splicing the 1536-dimensional picture semantic features and the 512-dimensional problem features, inputting the spliced semantic features and the 512-dimensional problem features into a large language model, and outputting a simulated text label;

s34, judging whether the simulated text label is consistent with the text label corresponding to the forklift data set, otherwise, finely adjusting the instruction and repeating the step S32 until the simulated text label is consistent with the text label corresponding to the forklift data set;

s4, quantizing the combined model, inputting the picture to be detected into the quantized combined model, outputting a forklift text label, and further obtaining the illegal road information of the forklift.

Furthermore, the multi-modal feature alignment model is constructed by adopting two Encoder structures, feature extraction is respectively carried out on the picture and the text, and the output dimension of the picture feature and the text feature is 1536 dimensions through contrast learning training feature alignment.

Furthermore, the construction of the large language model adopts a Decoder structure, and the input characteristic dimension is 2048.

Further, the specific steps of constructing the pre-training multi-modal feature alignment model are as follows:

and inputting the training data set into a multi-modal feature alignment model to pretrain until the picture features output after the pictures in the training data set are input are aligned with the text features corresponding to the pictures.

Further, the construction of the pre-trained large language model specifically comprises the following steps:

inputting 2048-dimensional text features in the training data set into a large language model with the parameter number of 7B, and training in an autoregressive mode until a text label in a preset format is generated; wherein the total amount of text in the training dataset is greater than 2500 ten thousand.

Further, a specific step of quantizing the joint model is to quantize the parameter storage mode of the joint model from float32 to int8.

A large language model-based illegal forklift detection system comprises a video acquisition module, a forklift identification module, an illegal video acquisition module, an early warning module and a processing module.

The video acquisition module is used for acquiring video stream information of the road to be detected in real time.

The forklift identification module is used for carrying out forklift identification on the video stream information frame by frame.

The illegal video acquisition module is used for acquiring a forklift identification result, identifying a picture with a forklift as a starting frame, intercepting a video with a preset time period from video stream information as an illegal video, and storing the illegal video.

The early warning module is used for generating an early warning report according to the illegal video and the corresponding forklift identification result; the early warning report comprises time and position information of the illegal forklift and corresponding illegal videos.

The processing module is used for judging whether the early warning information in the early warning report is ready or not, if yes, the early warning report is sent to related personnel, otherwise, non-forklift marking is carried out on the initial frame of the illegal video in the early warning report, and the early warning report is deleted.

Further, the video acquisition module comprises a vehicle detection unit for extracting videos containing vehicle information in the video stream information.

Further, the forklift identification module comprises a forklift detection unit; the forklift detection unit is composed of a fine-tuning combined pre-training multi-modal feature alignment model and a combined model of a pre-training large language model and is used for identifying forklifts and forklift position information in each frame of pictures.

Further, the processing module further comprises a sample library unit for storing a starting frame picture of the marked offending video.

Compared with the prior art, the invention has the beneficial effects that:

1. according to the detection method, the pictures are identified through the combined model to obtain the illegal road information of the forklift, and compared with manual detection, the detection method has the advantages of high efficiency and detection accuracy, error rate reduction and labor cost reduction;

2. the detection system of the invention not only can detect the illegal condition of the forklift and acquire the illegal time and position information, but also can early warn the detected result to related personnel so as to facilitate the timely processing of the related personnel.

Drawings

The disclosure of the present invention is described with reference to the accompanying drawings. It is to be understood that the drawings are designed solely for the purposes of illustration and not as a definition of the limits of the invention. Wherein:

FIG. 1 is a flowchart of a method for detecting a offensive forklift based on a large language model according to embodiment 1 of the present invention;

fig. 2 is a block diagram of a large language model-based illegal forklift detection system according to embodiment 2 of the present invention.

Detailed Description

It is to be understood that, according to the technical solution of the present invention, those skilled in the art may propose various alternative structural modes and implementation modes without changing the true spirit of the present invention. Accordingly, the following detailed description and drawings are merely illustrative of the invention and are not intended to be exhaustive or to limit the invention to the precise form disclosed.

Example 1

Referring to fig. 1, the embodiment describes a method for detecting a violation forklift based on a large language model, which includes the following steps:

step 1, acquiring a history picture, and performing data processing and classification on the history picture to obtain a forklift data set; the forklift data set comprises forklift pictures and non-forklift pictures, and each picture corresponds to one text label.

The forklift dataset has a text label for both forklift pictures and non-forklift pictures. For example, the text label is in the form of "whether forklift is included, forklift location list". If the forklift picture is, the text label is "True, [ x1, y1, x2, y2 … ]". [ x1, y1, x2, y2 … ] represents a forklift position, and True represents an inclusion forklift. If the picture is not a forklift, the text label is' Flase, [ ]). False indicates no forklift and [ (no) indicates no positional information.

And step 2, acquiring a training data set, and inputting the training data set into a pre-constructed multi-modal feature alignment model and a large language model to obtain a pre-training multi-modal feature alignment model and a pre-training large language model.

The training dataset contains pictures for training the multimodal feature alignment model and text for training the large language model.

The multi-mode feature alignment model uses two Encoder structures to extract features of the picture and the text respectively, the two Encoder structures are composed of a transducer assembly, the picture and the text features with arbitrary dimensions and lengths can be extracted, the extracted feature length is limited to 1536 dimensions, and alignment of the two modes on different feature spaces is realized through the transducer structure. The training of feature alignment adopts a contrast learning mode, and the contrast learning loss function is that. Wherein (1)>Feature vector representing query sample, +.>Representing positive sample feature vector, ++>Is a temperature super parameter, is a scalar, temperature coefficient defaults to 0.7, < ->Serial number of the comparative sample, +.>Representing the eigenvector of the comparative sample.

The large language model adopts a Decoder structure, and the input characteristic dimension is 2048, namely the context length is 2048. The Decoder structure is similar to that of GPT 3.

The training data set is used for training the multi-modal feature alignment model in advance, and picture text pairs in the training data set are used for training in advance to obtain alignment of picture semantics and text semantics, namely, the output picture features are aligned with the text features corresponding to the picture.

The method comprises the steps of pre-training a large language model, wherein the parameter number of the large language model is 7B, using a training data set as a public Chinese and English text training set, enabling the total text amount to be 2500 ten thousand, training to be carried out in an autoregressive mode, and generating a next text through inputting the text, so that the large language model has text understanding and generating capability. Note that the text feature dimension of the input is 2048.

Step 3, combining the pre-training multi-mode feature alignment model and the pre-training large language model, and inputting a forklift data set for fine adjustment and combination to obtain a combined model; the specific steps of fine tuning the combined pre-training multi-modal feature alignment model and the pre-training large language model are as follows:

step 31, inputting a forklift data set into a pre-training multi-mode feature alignment model to obtain 1536-dimensional picture semantic features;

step 32, setting instruction questions by taking text labels contained in the forklift data set as answers, and encoding the instruction questions into 512-dimensional question features;

step 33, splicing the 1536-dimensional picture semantic features and the 512-dimensional problem features, inputting the spliced semantic features and the 512-dimensional problem features into a large language model, and outputting a simulated text label;

and step 34, judging whether the simulated text label is consistent with the text label corresponding to the forklift data set, otherwise, finely adjusting the instruction and repeating the step S32 until the simulated text label is consistent with the text label corresponding to the forklift data set.

The method comprises the steps of combining a pre-trained multi-mode feature alignment model with a large language model, performing instruction fine adjustment on the combined model by using a forklift data set, firstly inputting a multi-mode feature alignment model into a picture to obtain 1536-dimensional picture semantic features, secondly, performing instruction as a design required problem, for example, judging whether a forklift exists in the picture or not, encoding the problem to 512 dimensions, if the problem is left, using 0 to complement, inputting 2048 dimensions into the large language model, splicing 1536-dimensional picture semantic features with 512-dimensional problem instructions, inputting the large language model, and expected to output text labels of pictures in a forklift data set, wherein the picture contains the forklift, the forklift 1 is in the position of x1, y1, x2, y2, and the forklift 2 is in the position of x3, y3, x4, y4 …' or the picture does not contain the forklift. And calculating loss according to whether the prediction generated picture contains the forklift and the position of the forklift and the text label of the forklift data set picture, and fine-tuning the instruction until the loss is within the allowable error range if the loss exceeds the error range.

And 4, quantizing the combined model, inputting the picture to be detected into the quantized combined model, outputting a forklift text label, and further obtaining the illegal road information of the forklift.

And the joint model is quantized, and model parameters are saved in a float32 mode in training, so that the joint model is quantized to be saved by using int8, the occupied byte number is reduced by 3/4, and meanwhile, the model reasoning speed can be greatly improved.

The picture to be detected is obtained from the road monitoring video, the video with the vehicle information in the road monitoring video is obtained in a target detection mode, the video which does not contain the vehicle information is removed, and only the video stream containing the vehicle information is reserved; and extracting the video stream containing the vehicle information frame by frame to obtain the picture to be detected, so as to reduce the processing amount.

In practical application, the feature alignment model is described above, and the training process of the large language model adopts an autoregressive training mode. For example, a text section "I eat today" encounters a friend ", massive similar long text data is used in training, in the training process, when 'I eat today, encounter a place", the model is required to predict that the next word is a friend', and at the beginning of training, the model cannot correctly predict the friend, so that an error is formed between the model and the true value friend, which is called a loss, model parameters are updated through continuous iteration in a back propagation mode, model prediction errors are smaller and smaller, and model prediction is more and more accurate, which is a large model training process.

The fine tuning model adopts an SFT training mode, and the training mode is to control a large language model to answer questions in a conversational mode instead of the previous writing. Inputting a splice of a picture feature vector and a problem code, wherein the picture feature vector is generated by using a picture Encoder of a feature alignment model, and the picture is converted into 1536-dimensional features, and the problem is that a fixed text is that whether a fork truck is contained in the picture, if the fork truck is contained, the position of the fork truck is? ", the question text is encoded with an encoding length of 512, wherein the question blank positions are uniformly encoded as 0. The output is a text, namely, the picture of the data set label constructed in the step 1 contains a forklift, the positions of the forklift 1 are x1, y1, x2 and y2, the positions of the forklift 2 are x3, y3, x4 and y4 …, and the picture does not contain a forklift.

And finally, inputting a picture when model reasoning is carried out, firstly generating 1536-dimensional picture features through a picture feature alignment model, then splicing 512-dimensional problem codes coded by fixed problems, inputting 2048-dimensional vectors in total into a large language model, finally outputting a label text, judging whether a forklift exists in the picture, and returning to the forklift position if the forklift exists.

Based on this, the overall flow of the method of this embodiment is as follows: s1, preparing a forklift data set; s2, constructing a multi-mode feature alignment model; s3, constructing a large language model; s4, pre-training a multi-modal feature alignment model; s5, pre-training a large language model; s6, combining the pre-trained multi-modal feature alignment model with the large language model, and performing fine adjustment on the combined model by using a forklift data set; s7, quantifying the joint model; s8, for the picture to be identified, invoking the quantized joint model to obtain an output result.

According to the detection method, the joint model identification picture is established to obtain the illegal road information of the forklift, and compared with manual detection, the detection method has the advantages of high efficiency and detection accuracy, error rate reduction and labor cost reduction.

Example 2

Referring to fig. 2, the embodiment introduces a large language model-based illegal forklift detection system, which includes a video acquisition module, a forklift identification module, an illegal video acquisition module, an early warning module and a processing module.

The video acquisition module is used for acquiring video stream information of the road to be detected in real time. The video acquisition module comprises a vehicle detection unit and is used for extracting videos containing vehicle information from the video stream information and automatically removing video streams which do not contain the vehicle information, so that the times of calling the forklift identification module can be greatly reduced. The video acquisition module can utilize a camera, and an integrated micro vehicle detection unit is embedded in the camera chip.

The forklift identification module is used for carrying out forklift identification on the video stream information frame by frame. The forklift identification module comprises a forklift detection unit, and the forklift detection unit adopts the quantized joint model constructed in the embodiment 1. And carrying out frame-by-frame identification on the video stream information acquired by the video acquisition module through the joint model.

The illegal video acquisition module is used for acquiring a forklift identification result, identifying a picture with a forklift as a starting frame, intercepting a video with a preset time period from video stream information as an illegal video, and storing the illegal video. The offending video may be saved to a special problem video database for later recall.

The processing module is used for judging whether the early warning information in the early warning report is ready or not, if yes, the early warning report is sent to related personnel, otherwise, non-forklift marking is carried out on the initial frame of the illegal video in the early warning report, and the early warning report is deleted. In addition, deleting the early warning report, deleting the marked illegal video from the problem video database, and storing the initial frame of the marked illegal video to a sample library unit in the processing module for sample material accumulation, wherein the method is mainly used for continuously optimizing the quantized joint model.

The actual application process of the system of the embodiment is as follows: and the fork truck detection cameras are deployed, high-definition detection cameras are deployed at important traffic intersections and used for collecting road conditions, meanwhile, the vehicle detection units are integrated inside the cameras, if no vehicle is contained in a road, the fork truck identification modules can be directly processed at the end sides without being called, and the pressure of the fork truck identification modules is greatly reduced. If the road contains vehicles, the camera acquires real-time frame pictures, the real-time frame pictures are sent to the forklift identification module, a quantized combined model is adopted when the forklift identification module identifies, the pictures to be detected are input into the module, whether the pictures contain forklifts and forklift position results are obtained, the results are in the form of (whether forklift position lists are contained or not), whether forklift is contained or not is indicated by True and Flase, true indicates that forklift is contained, at the moment, the forklift position list is [ x1, y1, x2, y2 … ], false indicates that forklift is not contained, and the forklift position list is [ ]. If the forklift identification module detects that the forklift is contained, the illegal video acquisition module intercepts a video of a period near the problem frame and stores the video in the problem video database. Meanwhile, the early warning module generates a relevant detected forklift warning, the processing module sends detected forklift early warning to relevant management staff and pushes corresponding videos to relevant staff together, if the early warning is correct, the management staff can process the videos in time, if the early warning is incorrect, the management staff marks the videos in the background, and the marked problem frame is non-forklift and is used for iterative training of forklift detection models.

The detection system of the invention not only can detect the illegal condition of the forklift and acquire the illegal time and position information, but also can early warn the detected result to related personnel so as to facilitate the timely processing of the related personnel.

The technical scope of the present invention is not limited to the above description, and those skilled in the art may make various changes and modifications to the above-described embodiments without departing from the technical spirit of the present invention, and these changes and modifications should be included in the scope of the present invention.

Claims

1. A detection method of a violation forklift based on a large language model is used for acquiring the violation road information of the forklift; the method is characterized by comprising the following steps of:

s4, quantizing the joint model parameter storage mode from float32 to qunt8, inputting a picture to be detected into the quantized joint model, outputting a forklift text label, and judging according to the forklift text label to obtain the illegal road information of the forklift; the forklift text label comprises information of whether a forklift exists or not and position information of the forklift when the forklift exists.

2. The large language model-based illegal forklift detection method according to claim 1, wherein the multi-modal feature alignment model is constructed by adopting two Encoder structures, feature extraction is respectively carried out on a picture and a text, and the feature alignment is learned by comparison, so that the output dimension of the picture feature and the text feature is 1536 dimensions.

3. The large language model-based illegal forklift detection method according to claim 1, wherein the large language model is constructed by adopting a Decoder structure, and the input feature dimension is 2048.

4. The large language model-based illegal forklift detection method according to claim 1, wherein the construction of the pre-training multi-modal feature alignment model comprises the following specific steps:

5. The large language model-based illegal forklift detection method as claimed in claim 1, wherein the construction of the pre-trained large language model comprises the following specific steps:

6. The method for detecting the illegal forklift based on the large language model as claimed in claim 1, wherein the method for acquiring the picture to be detected comprises the following specific steps:

acquiring a road monitoring video and removing videos which do not contain vehicle information to obtain a video stream containing the vehicle information; and extracting the video stream containing the vehicle information frame by frame to obtain the picture to be detected.

7. The utility model provides a illegal fork truck detecting system based on big language model which characterized in that it includes:

the video acquisition module is used for acquiring video stream information of the road to be detected in real time;

the forklift identification module is used for carrying out forklift identification on the video stream information frame by frame;

the illegal video acquisition module is used for acquiring a forklift identification result, identifying a picture with a forklift as a starting frame, intercepting a video with a preset time period from video stream information as an illegal video, and storing the illegal video;

the early warning module is used for generating an early warning report according to the violation videos and the corresponding forklift identification results; the early warning report comprises time and position information of the illegal forklift and corresponding illegal videos;

the processing module is used for judging whether the early warning information in the early warning report is prepared or not, if yes, the early warning report is sent to related personnel, otherwise, non-forklift marking is carried out on the initial frame of the illegal video in the early warning report, and the early warning report is deleted;

wherein, when the forklift identification module carries out forklift identification, the steps of the method for detecting the illegal forklift based on the large language model according to any one of claims 1 to 6 are adopted.

8. The large language model based offence forklift detection system of claim 7, wherein the video acquisition module includes a vehicle detection unit for extracting video containing vehicle information in the video stream information.

9. The large language model based offence forklift detection system of claim 7, wherein the forklift identification module comprises a forklift detection unit; the forklift detection unit is composed of a fine-tuning combined pre-training multi-modal feature alignment model and a combined model of a pre-training large language model and is used for identifying forklifts and forklift position information in each frame of pictures.

10. The large language model based offending forklift detection system of claim 7, wherein said processing module further comprises a sample library unit for storing a starting frame picture of a tagged offending video.