CN117648929A

CN117648929A - Target false recognition correction method based on similar humanized generalized perception mechanism

Info

Publication number: CN117648929A
Application number: CN202311389678.9A
Authority: CN
Inventors: 周劲草; 宁本燏; 傅卫平; 李睿; 杨世强
Original assignee: Xian University of Technology
Current assignee: Xian University of Technology
Priority date: 2023-10-25
Filing date: 2023-10-25
Publication date: 2024-03-05

Abstract

The invention discloses a target false recognition correction method based on a humanized generalized perception mechanism, which comprises the steps of firstly, perceiving image information of traffic scenes at different moments, constructing an image description model, generating image description texts at corresponding moments, and solving semantic similarity of the image description texts at the front moment and the rear moment; secondly, extracting the triplet information of the image description text at the front and rear moments by using a triplet extraction model, performing similarity matching on the triplet information and verifying the triplet with difference; and finally, generating a standardized program question, reasoning the question, and sending an output result to a downstream planning decision task. The method solves the problem of misidentification of the object in the process of sensing the complex scene in the prior art, and improves the robustness and the accuracy of object identification in the process of sensing.

Description

Target false recognition correction method based on similar humanized generalized perception mechanism

Technical Field

The invention belongs to the technical field of automatic driving perception methods, and relates to a target false recognition correction method based on a humanized generalized perception mechanism.

Background

In the automatic driving field, the target recognition task is effectively completed along with the development of the deep learning technology, however, the problem of target misrecognition is still not effectively solved under the influence of factors such as accuracy of an algorithm, quality of input data, illumination change and the like, wherein the target misrecognition refers to that a system or the algorithm erroneously recognizes a target object as other objects, and the reliability and safety of decision of an automatic driving system are reduced due to the erroneous recognition of the target. For example: animal pictures on the advertising board are mistakenly identified as real animals, so that unnecessary avoidance behaviors of autonomous vehicles can be caused, and the safe passing of other traffic participants is further affected. For another example, shadows formed by illumination are misidentified as lane lines, resulting in dangerous lateral shifts of the host vehicle. Therefore, the problem of target misrecognition is significant for pushing and guaranteeing large-scale commercial landing and safe operation of an automatic driving system.

From a theoretical perspective, although corresponding models can be set according to different wrongly identified objects so as to improve the identification accuracy, from an actual perspective, on one hand, unknown objects of which the number is not counted in urban road scenes are faced, and a one-to-one solution adopted in a traditional target identification task is difficult to popularize on a large scale due to huge cost; on the other hand, compared with the target recognition tasks in other computer vision fields, the target recognition task in the automatic driving system has stronger instantaneity, namely the target recognition task in the automatic driving process must be processed rapidly, and the problem of misrecognition cannot be processed by a post offline retraining method like the target recognition task in other fields. However, how to enable the automatic driving system to autonomously discover the misidentification problem in real time and automatically finish correction online, and provide accurate perception input for subsequent decision and planning tasks, which has not been solved yet, so that a plurality of uncertain factors exist in the decision safety and reliability of the automatic driving system.

In order to cope with potential safety hazards possibly caused by the problem of misrecognition, a target misrecognition correction method based on a humanized generalized perception mechanism is constructed, and the humanized generalized perception mechanism facing an automatic driving system is constructed by taking image semantic information and knowledge cognition form as methods, so that the problem of misrecognition of targets in the automatic driving field is solved, and a reliable and effective perception input is provided for automatic driving. The research of the comprehensive perception model aims at improving the perception capability of the automatic driving vehicle in a complex scene, and provides new ideas and solutions for further development of automatic driving technology.

Disclosure of Invention

The invention aims to provide a target misrecognition correction method based on a humanized generalized perception mechanism, which solves the problem of misrecognition of an object in the process of perceiving a complex scene in the prior art, and improves the robustness and the accuracy of object identification in the perceiving process.

The technical scheme adopted by the invention is that the target false recognition correction method based on a similar humanized generalized perception mechanism comprises the steps of firstly, perceiving image information of traffic scenes at different times, constructing an image description model, generating image description texts at corresponding moments, and solving semantic similarity of the image description texts at the front moment and the rear moment; secondly, extracting the triplet information of the image description text at the front and rear moments by using a triplet extraction model, performing similarity matching on the triplet information and verifying the triplet with difference; and finally, generating a standardized program question, reasoning the question, and sending an output result to a downstream planning decision task.

The invention is also characterized in that:

the target false recognition correction method based on the similar humanized generalized perception mechanism is implemented according to the following steps:

step 1, sensing image information of traffic scenes at different times;

step 2, constructing an image description model;

step 3, generating an image description text corresponding to the moment based on the image description model in the step 2;

step 4, carrying out semantic similarity solution on the image description texts of the front and rear moments obtained in the step 3, if the similarity is higher than a threshold value, returning to the step 1, continuing to carry out image description on the scene of the next moment, and if the similarity is lower than the threshold value, executing the next step;

step 5, extracting the triplet information of the image description text before and after the moment in the step 4 by using a triplet extraction model;

step 6, performing similarity matching on the triplet information obtained in the step 5 to obtain triplet information of which the scene changes in the later moment compared with the scene in the previous moment;

step 7, verifying the triad with the difference obtained in the step 6 by using a Deep Path model, and if the triad accords with the prediction, returning to the step 1 to continue to describe the image of the scene at the next moment; if not, executing the next step;

step 8, generating a standardized program question;

step 9, reasoning the question generated in the step 8 by using a large language model;

and step 10, sending the output result obtained in the step 9 to a downstream planning decision task, and providing safer and more reliable data for a subsequent task.

The image description model is constructed by the following steps: performing feature extraction and coding on the scene graph by using a convolutional neural network, and capturing text information of traffic scene elements by using a long-term and short-term memory neural network decoding mode; meanwhile, the YOLO v8 is used for identifying traffic signs and signal lamps, and the motion state information of scene elements is obtained through binocular vision and tracking algorithm; and using GPS and map software to sense GPS information of the own vehicle in the scene and obtain the needed macroscopic position and time information.

The specific process for generating the image description text comprises the following steps: and obtaining a description text of the traffic scene element, a motion state description text of the scene element and a position and time description text of the vehicle by using the image description model, and obtaining a complete description text at a corresponding moment by using a multi-subject fusion algorithm.

The construction process of the triplet extraction model is as follows: and carrying out text analysis processing on the traffic scene text information library to obtain a traffic scene text corpus, then using a labeling platform to finish text labeling to obtain a triplet data set containing entities and relations, and training, verifying and testing the triplet extraction model by using the data set to obtain the triplet extraction model.

The specific process of verifying the differential triples is as follows: assuming that the number of the obtained differential triples is P, the differential P triples (H ₁ ,H ₂ ……,H _p ) P head-tail entity groups in the knowledge graph are obtained, P deduced relations are obtained, and the deduced relations are compared with relations of corresponding head-tail entities in the knowledge graph constructed by priori experience.

The generation process of the standardized program question comprises the following steps: if the verification is not in accordance with the prediction, performing text preprocessing on the triad with the difference to obtain a standardized program question Q in accordance with the large language model.

The beneficial effects of the invention are as follows:

(1) The method of the invention constructs a target false recognition correction method based on a similar humanized generalized perception mechanism by fusing knowledge patterns, deep learning and a large language model, abandons a method for performing target recognition based on image visual characteristics, turns to a method for jointly starting from an image semantic layer and a human knowledge cognition layer, performs initial recognition, re-verification and reverse theory of objects, and has better accuracy and robustness compared with the method for performing target perception based on road scene image information;

(2) In the scene sensing process, the method solves the problem of false recognition of the target in the automatic driving field through a similar humanized generalized sensing mechanism, and provides a reliable and effective sensing input for the automatic driving technology;

(3) The method provided by the invention is used for acquiring multi-dimensional scene semantic information through extracting the scene image semantic layer based on a similar humanized thinking mode so as to identify scene change, and mining the intrinsic characteristic information of the scene from the semantic layer more deeply, so that compared with a method for judging the scene change based on the road scene image information characteristics only, the method can be used for ascending from an objective specific layer to a subjective summarizing layer, and the accuracy and the mobility of scene change detection are improved;

(4) The method provides a target re-verification link, provides a judgment method for target false recognition, avoids the problem that 'eyes are not real' caused by the traditional perception method, indirectly realizes target false recognition detection by comparing the triplet relation of the actual extraction verification prediction from the perception result with the triplet relation stored in the priori knowledge, and effectively improves the reliability of the perception result;

(5) The method is based on the image description model, so that the problems of single traditional image description information, lack of scene object attributes and lack of dynamic description are better solved, and the current scene can be perceived more completely and accurately; based on the large language model, the method can better infer what the identified object is, and provide more reliable and safe data support for subsequent planning, decision-making and other tasks.

Drawings

FIG. 1 is a flow chart of the method of the present invention;

FIG. 2 is a process block diagram of the method of the present invention;

FIG. 3 is a schematic diagram of a text description generated by a target primary recognition module in a scene of setting a recognition area in the method of the present invention;

FIG. 4 is a schematic overview of knowledge graph generated by a dataset in the method of the present invention;

FIG. 5 is an enlarged schematic view of a portion of the knowledge-graph generated by the dataset of FIG. 4;

FIG. 6 is a schematic diagram of a partial triplet generated by an embodiment of the invention;

FIG. 7 is a schematic diagram of the large language model anti-reasoning in the method of the present invention.

Detailed Description

The invention will be described in detail below with reference to the drawings and the detailed description.

The invention discloses a target false recognition correction method based on a similar generalized perception mechanism, which aims to solve the false recognition problem of autonomous vehicle targets in the field of automatic driving and increases the comprehensiveness and reliability of target perception in complex traffic scenes through the similar generalized perception mechanism. Firstly, an initial recognition link of a humanized generalized perception mechanism is used for generating a scene multidimensional semantic description text based on an image description model for scenes at different times before and after respectively through establishing the image description model, so that similarity solving is carried out, and whether the scenes at the times before and after are obviously changed is judged from a scene semantic level. And secondly, the re-verification link of the humanized generalized perception mechanism is to apply triple extraction and knowledge graph reasoning technology, obtain a variable triple through triple extraction and matching according to semantic description text generated from an automatic driving scene, input a head entity and a tail entity of the variable triple as verification models, so as to predict and obtain relation information between the entities, and compare the predicted and obtained triple relation with a corresponding triple relation stored in priori knowledge, thereby indirectly realizing detection of target error identification. Thirdly, the reverse-push link of the humanized generalized perception mechanism is based on a large language model, triple information generated by the verification model is used as the input of the large language model, so that what an object is obtained by reasoning, and the complete humanized generalized perception mechanism is realized.

The method establishes a multidimensional scene image description model, performs feature extraction on the current traffic scene, and obtains a description text of scene elements in a coding and decoding mode; using YOLO v8 to identify traffic signs, signal lamps and the like, acquiring motion state information of scene elements through binocular vision and a tracking algorithm, and generating a motion state description text of the scene elements; macroscopic position and time information of the self-vehicle is obtained using GPS and hundred degree APIs, and a time position description text is generated. And finally, fusing all the text information to generate comprehensive traffic scene description, so as to solve the semantic similarity of the generated texts at the front and rear moments.

And acquiring a description text library of the traffic scene, constructing a triplet data set of the traffic scene through preprocessing and labeling of a labeling platform, and training, verifying and testing a triplet extraction model through the data set. Based on the triple extraction model and the complete traffic scene description text generated in the initial link, inputting the generated text into the triple extraction model to obtain a triple of the current description scene, using head-tail entity data of the triple with difference as input of Deep Path model reasoning, comparing the head-tail entity data with the original relation, verifying whether the triple is correct or not, if the triple is not in accordance with the verification, further inputting the triple data into a large language model for reasoning, and providing data service for tasks such as follow-up planning decision and the like by the obtained reasoning result.

The method of the invention is based on the following steps: and obtaining traffic scene image information, constructing an image description model, solving semantic similarity, extracting triplet information by using a triplet extraction model, matching the triplet information, verifying a Deep Path model, and reasoning a large language model.

The invention discloses a target false recognition correction method based on a humanized generalized perception mechanism, which is implemented as shown in figures 1 and 2, and specifically comprises the following steps of:

step 1, different moments (t) ₁ ……t _n ) Image information of traffic scenes.

As shown in the scene information text description module in fig. 2, the vehicle-mounted camera is applied to sense a real-time scene photo with a certain time sequence interval, so as to obtain the required image information.

And 2, constructing an image description model.

As shown in a scene information text description module in fig. 2, a description text of a traffic scene element, a motion state description text of the scene element and a description text of a vehicle time and a position are respectively acquired to construct an image description model, wherein the specific construction mode is as follows: feature extraction and coding are carried out on the scene graph by using a convolutional neural network and the like, and text information of traffic scene elements is captured by using a decoding mode such as a long-term memory neural network and the like; meanwhile, the traffic sign and the signal lamp in the road scene are identified by using the YOLO v8, so that the identification of the road scene elements is realized; obtaining information such as distance, speed and the like of vehicles, pedestrians and the like in front according to a binocular vision principle and a tracking algorithm, and setting the speed and the distance to a plurality of grades such as slow, faster and fast, near, far and the like according to the obtained speed and distance information, so that a motion state text of scene elements is more clearly and directly expressed; and processing the obtained real-time scene photo and GPS information by using GPS information of the vehicle-mounted GPS sensing scene and using hundred-degree API to obtain the position and time information text of the vehicle. And finally, fusing all the text information to generate complete description of the traffic scene, thereby forming an image description model.

Step 3, generating corresponding time image description text (L ₁ ……L _n )。

As shown in the scene information text description module in fig. 2, based on the image description model of step 2, t is first acquired ₁ Text information of time of day, including: descriptive text L of traffic scene elements _a Such as traffic participants in the current scene, type and depth information of traffic facilities, etc.; the vehicle time and position description text Lb, such as the position and time of the vehicle in the current scene; describing text Lc of motion state of scene element, such as motion speed of traffic participants of vehicles, pedestrians, etc. in current scene, and adding the obtained text La, lb, L _c Subject fusion processing is carried out through a multi-subject fusion algorithm, and a complete description text L containing scene multidimensional information is obtained ₁ . The same method is applied to t _n Time scene, obtaining a complete description text L of the next time containing scene multidimensional information _n 。

And 4, carrying out semantic similarity solving on the image description texts of the front and rear moments obtained in the step 3.

As shown in the similarity matching module in fig. 2, the text L is completely described for the previous and subsequent moments obtained in step 3 ₁ And L _n Performing similarity matching, if the similarity is higher than a threshold value, returning to the step 1, and continuing to perform image description on the scene at the next moment; if the threshold value is lower, the next step is executed.

Step 5, extracting the complete description text L before and after the moment in the step 4 by using a triplet extraction model ₁ And L _n Is included in the three-tuple information.

Data set as in fig. 2The module shows that a triplet extraction model is constructed first: the method comprises the steps of performing text analysis processing on a traffic scene text information library to obtain a traffic scene text corpus, completing text labeling by using a labeling platform to obtain a triplet data set containing entities and relations, generating a knowledge graph shown in fig. 4 and 5, namely a traffic scene knowledge graph, training, verifying and testing a triplet extraction model by using the triplet data set to obtain a triplet extraction model, and inputting a text L based on the model ₁ And L _n And (5) extracting the structured triples to obtain the triples.

And 6, performing similarity matching on the triplet information obtained in the step 5.

As shown in the verification module in fig. 2, the text L is respectively identified in step 5 ₁ And L _n The extracted K triples (G ₁ ,G ₂ ……,G _k ) And X triples (D ₁ ,D ₂ ……,D _x ) Performing similarity matching, wherein (K, X is greater than or equal to 1), obtaining (D ₁ ,D ₂ ……,D _x ) In comparison with the triplet (G) ₁ ,G ₂ ……,G _k ) P triples with differences (H ₁ ,H ₂ ……,H _p ) And obtaining the triple information of the scene at the later time compared with the scene at the previous time.

And 7, verifying the differential triples obtained in the step 6 by using a Deep Path model.

Because the automatic driving knowledge graph constructed by using prior experience can basically cover the relations of different head and tail entities in various driving scenes, the relation R is obtained by inputting the head entity H and the tail entity T of the triplet generated by visual information into the Deep Path model reasoning, and then the corresponding head entity H is searched in the original constructed knowledge graph ^* And tail entity T ^* Corresponding relation R in (a) ^* Comparing relationship R with relationship R ^* If the two are inconsistent, the false recognition is considered to be present, and if the two are identical, the step 1 is executed again. As shown in the verification module in fig. 2, P triples (H ₁ ,H ₂ ……,H _p ) P head-tail entity groups in the model are used for obtaining P deduced relations. Comparing the relationship obtained by reasoning with the relationship of the corresponding head and tail entities in the knowledge graph constructed by priori experience, and if the relationship accords with the prediction, returning to the execution step 1, and continuing to describe the image of the scene at the next moment; if not, executing the next step.

And 8, generating a standardized program question.

As shown in the reverse theory block of FIG. 2, if the verification in step 7 does not conform to the prediction, the P differential triples (H ₁ ,H ₂ ……,H _p ) And performing text preprocessing, including screening and grammar adjustment, so as to obtain a standardized program question Q conforming to the large language model format.

And 9, reasoning by using a large language model.

As shown in the inverse theory module in fig. 2, the standardized program question Q obtained in the step 8 is input into a large language model, and an output result is obtained.

And step 10, planning decision.

As shown in the inverse theory module in fig. 2, the output result obtained in the step 9 is sent to downstream tasks such as planning and decision making, so as to provide safer and more reliable data for the subsequent tasks.

Example 1

The specific implementation method of the embodiment is as follows:

and step 1, obtaining real-time sequence scene photos, wherein the scene photos are image frames extracted at certain intervals according to real-time videos generated by an image pickup sensor on an autonomous vehicle.

And 2, identifying elements in a set identification area in the traffic scene based on the image description model, identifying white trucks, automobiles, traffic lights, pedestrians and the like in the image as shown in fig. 3, and inputting a target identification result into LSTM decoding operation by the model to generate a description text of the elements of the traffic scene. Meanwhile, the model also uses a YOLO v8 network to identify traffic signs and signal lamps, realizes the tracking and speed measurement of a front target by combining a tracking algorithm and a binocular vision ranging principle, and outputs a motion state description text of elements, such as identifying that the traffic lamps in fig. 3 are in a green state, and the vehicle is in a faster running state. Calling a hundred-degree map API and a vehicle-mounted GPS, and describing text of the position and time of the vehicle. And fusing the description text of the traffic scene elements, the motion state description text of the traffic scene elements and the position and time description text of the vehicle to generate a complete traffic scene description text, wherein the display result is shown in the Chinese character in figure 3.

Steps 3 to 4, based on the image frames extracted at a predetermined interval in step 2, generating a time (t ₁ And t _n ) Text (L) ₁ And L _n ) And carrying out semantic similarity solving on the generated text, and judging whether the object identified at the front and rear moments has state change or not. If the similarity is higher than the threshold value, the scene is unchanged, the step 1 is executed again, and the image description is continuously carried out on the scene at the next moment; if the similarity is lower than the threshold value, the next step is executed.

And 5, collecting text information from a text library containing descriptive traffic scenes, processing the text information to generate a large number of unstructured text sets, and constructing a traffic scene text corpus. Then, labeling the texts according to the definition of entities and relations in the traffic scene by means of a labeling platform, creating a triplet data set suitable for a triplet extraction model, as shown in fig. 4 and 5, for the knowledge graph formed by the data set in this embodiment, training, verifying and testing the data set, and building the triplet extraction model for respectively extracting text (L ₁ And L _n ) And extracting the triplet information.

Step 6-7, text (L) ₁ And L _n ) Matching and comparing the extracted triplet information in the text L _n Compared with the text L ₁ Inputting head and tail entities of triples with differences in the triples extracted in the process of the extraction into a Deep Path model, predicting a relation according to the head and tail entities, verifying the predicted relation and the relation extracted from the corresponding text, and if the predicted relation accords with the predicted relation, returning to the step 1, and continuing to describe the image of the scene at the next moment; if not, executing the next step. As shown in fig. 6, based on rawThe generated text information and the triplet extraction model obtain the triplet information of pedestrians and traffic signs, such as 'human-speed-faster' and 'traffic sign-yes-STOP'.

Step 8-10, t _n And (3) taking triple information in the moment scene graph as input, obtaining a standardized program question conforming to a large language model through text preprocessing, reasoning by using the large language model, and reasoning what the object is according to the head-tail entity and the relation. Assuming that the bus is a picture on an elephant, as shown in fig. 7, the picture may be erroneously recognized as a real elephant in automatic driving, which may cause danger. In the method of the present invention, however, if the depth information does not conform to the Deep Path model, in step 7, a reverse inference module will perform reasoning judgment to determine t _n The triplet information of the moment elephant is converted into a standardized program question sentence: "something like an elephant, something is fast, something is 0 in depth, something is on a bus, what is this something? The reasoning result is that the fact that the something is likely to be an elephant pattern or decoration drawn on the body of the bus is an answer which is more in line with expectations. After the reasoning result is obtained, the result is input into the subsequent tasks such as planning decision and the like, and the downstream tasks are executed more safely and reliably.

Example 2

Firstly, perceiving image information of traffic scenes at different moments, constructing an image description model, generating image description texts at corresponding moments, and solving semantic similarity of the image description texts at the front moment and the rear moment; secondly, extracting the triplet information of the image description text at the front and rear moments by using a triplet extraction model, performing similarity matching on the triplet information and verifying the triplet with difference; and finally, generating a standardized program question, reasoning the question, and sending an output result to a downstream planning decision task.

Example 3

step 1, sensing image information of traffic scenes at different times;

step 2, constructing an image description model;

step 8, generating a standardized program question;

Claims

1. The target false recognition correction method based on the similar humanized generalized perception mechanism is characterized in that firstly, image information of traffic scenes at different moments is perceived, an image description model is built, image description texts at corresponding moments are generated, and semantic similarity solving is carried out on the image description texts at the front moment and the rear moment; secondly, extracting the triplet information of the image description text at the front and rear moments by using a triplet extraction model, performing similarity matching on the triplet information and verifying the triplet with difference; and finally, generating a standardized program question, reasoning the question, and sending an output result to a downstream planning decision task.

2. The target misrecognition correction method based on a humanized generalized perception mechanism according to claim 1, which is characterized by comprising the following steps:

step 1, sensing image information of traffic scenes at different times;

step 2, constructing an image description model;

step 8, generating a standardized program question;

3. The target misrecognition correction method based on a humanized generalized perception mechanism according to claim 1 or 2, wherein the image description model is constructed by the following steps: performing feature extraction and coding on the scene graph by using a convolutional neural network, and capturing text information of traffic scene elements by using a long-term and short-term memory neural network decoding mode; meanwhile, the YOLO v8 is used for identifying traffic signs and signal lamps, and the motion state information of scene elements is obtained through binocular vision and tracking algorithm; and using GPS and map software to sense GPS information of the own vehicle in the scene and obtain the needed macroscopic position and time information.

4. The target misrecognition correction method based on a humanized generalized perception mechanism according to claim 1 or 2, wherein the specific process of generating the image description text is as follows: and obtaining a description text of the traffic scene element, a motion state description text of the scene element and a position and time description text of the vehicle by using the image description model, and obtaining a complete description text at a corresponding moment by using a multi-subject fusion algorithm.

5. The target misrecognition correction method based on a humanized generalized perception mechanism according to claim 1 or 2, wherein the process of constructing the triplet extraction model is as follows: and carrying out text analysis processing on the traffic scene text information library to obtain a traffic scene text corpus, then using a labeling platform to finish text labeling to obtain a triplet data set containing entities and relations, and training, verifying and testing the triplet extraction model by using the data set to obtain the triplet extraction model.

6. The target misrecognition correction method based on a humanized generalized perception mechanism according to claim 1 or 2, wherein the specific process of verifying the divergent triples is as follows: assuming that the number of the obtained differential triples is P, the differential P triples (H ₁ ,H ₂ ……,H _p ) P head-tail entity groups in the knowledge graph are obtained, P deduced relations are obtained, and the deduced relations are compared with relations of corresponding head-tail entities in the knowledge graph constructed by priori experience.

7. The target misrecognition correction method based on a humanized generalized perception mechanism according to claim 1 or 2, wherein the generation process of the standardized program question is as follows: if the verification is not in accordance with the prediction, performing text preprocessing on the triad with the difference to obtain a standardized program question Q in accordance with the large language model.