CN114756700A

CN114756700A - Scene library establishing method and device, vehicle, storage medium and chip

Info

Publication number: CN114756700A
Application number: CN202210686373.3A
Authority: CN
Inventors: 张琼; 杨奎元
Original assignee: Xiaomi Automobile Technology Co Ltd
Current assignee: Xiaomi Automobile Technology Co Ltd
Priority date: 2022-06-17
Filing date: 2022-06-17
Publication date: 2022-07-15
Anticipated expiration: 2042-06-17
Also published as: CN114756700B

Abstract

The disclosure relates to a scene library establishing method, a scene library establishing device, a vehicle, a storage medium and a chip, and relates to the technical field of automatic driving. The method comprises the following steps: performing language description on a multi-frame image of a video to be processed to obtain a plurality of description texts of the multi-frame image; matching the labels of the scene library with a plurality of description texts of the multi-frame images respectively to obtain target description texts matched with the labels; and storing the first target image corresponding to the target description text into the scene library to obtain a target scene library with the first target image. By using the method for establishing the scene library, the first target image can be automatically stored in the scene library to obtain the target scene library, manual marking and storage are not needed, and the efficiency of establishing the target scene library is improved.

Description

Scene library establishing method and device, vehicle, storage medium and chip

Technical Field

The disclosure relates to the technical field of automatic driving, and in particular relates to a scene library establishing method and device, a vehicle, a storage medium and a chip.

Background

At present, before the mass production of the unmanned vehicle, the driving scenes of the unmanned vehicle need to be fully tested, and the unmanned vehicle can be adapted to different driving scenes by testing each driving scene to realize safe driving.

Therefore, in order to fully test the driving scene of the unmanned vehicle, different scene libraries need to be established, the driving scene of the unmanned vehicle is tested by using the driving scene in the scene libraries, in the process of establishing the scene libraries, workers collect each video on the road through a road vehicle, mark each video clip in the video in a manual labeling mode to mark the driving scene to which the video clip belongs, and finally store the video clips with different labels in different scene libraries to establish different scene libraries.

However, since there are many driving scenes on the road, there are many video clips that need to be labeled manually, and the creation efficiency of the scene library is slow.

Disclosure of Invention

In order to overcome the problems in the related art, the present disclosure provides a method and an apparatus for establishing a scene library, a vehicle, a storage medium, and a chip.

According to a first aspect of the embodiments of the present disclosure, a method for establishing a scene library is provided, including:

performing language description on multi-frame images of a video to be processed to obtain a plurality of description texts of the multi-frame images;

matching the labels of the scene library with the plurality of description texts of the multi-frame images respectively to obtain target description texts matched with the labels;

And storing a first target image corresponding to the target description text into the scene library to obtain a target scene library with the first target image.

Optionally, matching the tags of the scene library with the multiple description texts of the multiple frames of images respectively to obtain a target description text matched with the tags, including:

determining a target model according to the label of the scene library;

and matching the labels of the scene library with the plurality of description texts of the multi-frame images respectively through the target model to obtain the target description texts matched with the labels.

Optionally, storing a first target image corresponding to the target description text in the scene library to obtain a target scene library with the first target image, where the method includes:

under the condition that the number of the acquired first target images is larger than a first preset number, obtaining a first target video clip according to the timestamps of a plurality of frames of the first target images;

and storing the first target video clip into the scene library to obtain a target scene library with the first target video clip.

Optionally, storing the first target image corresponding to the target description text in the scene library to obtain a target scene library with the first target image, including:

Under the condition that the number of the acquired first target images is smaller than a second preset number, obtaining a second target video clip according to the time stamp of a plurality of frames of second target images adjacent to the first target images, or according to the time stamp of the first target images and the time stamp of the plurality of frames of second target images;

and storing the second target video clip into the scene library to obtain a target scene library with the second target video clip.

Optionally, the second target image is obtained by:

respectively carrying out feature comparison on the multi-frame associated images adjacent to the first target image and the first target image; when any one of the associated images is matched with the features of the first target image, continuing to perform feature comparison on the next frame of associated image in the multiple frames of associated images with the first target image until an associated image which is not matched with the features of the first target image appears, and obtaining one or more associated images matched with the features of the first target image;

and taking the one or more associated images as the second target image.

Optionally, performing language description on a multi-frame image of a video to be processed to obtain a plurality of description texts of the multi-frame image, where the language description includes:

Performing language description on the multi-frame image in the video to be processed and the multiple sections of voice in the video to be processed to obtain multiple description texts of the multi-frame image;

wherein, the multi-frame images used for describing the same picture correspond to the same voice.

Optionally, the target model is obtained by training through the following steps:

and training the model by using the training description text and the label corresponding to the training description text to obtain the target model.

According to a second aspect of the embodiments of the present disclosure, there is provided a scene library creating apparatus, including:

the description module is configured to perform language description on multi-frame images of a video to be processed to obtain a plurality of description texts of the multi-frame images;

the matching module is configured to match the labels of the scene library with the plurality of description texts of the multi-frame images respectively to obtain target description texts matched with the labels;

and the storage module is configured to store the first target image corresponding to the target description text into the scene library to obtain a target scene library with the first target image.

According to a third aspect of the embodiments of the present disclosure, there is provided a vehicle including:

A processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to:

the steps of the scene library establishment method provided by the first aspect of the present disclosure are implemented.

According to a fourth aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium on which computer program instructions are stored, which program instructions, when executed by a processor, implement the steps of the scene library establishment method provided by the first aspect of the present disclosure.

According to a fifth aspect of embodiments of the present disclosure, there is provided a chip comprising a processor and an interface; the processor is configured to read instructions to execute the steps of the scene library establishment method provided by the first aspect of the present disclosure.

The technical scheme provided by the embodiment of the disclosure can have the following beneficial effects:

by the scene library establishing method, the multi-frame images of the videos to be processed obtained by the road acquisition can be subjected to language description to obtain a plurality of description texts of the multi-frame images, the description texts are respectively matched with the labels of different scene libraries one by one to determine the target description texts matched with the labels of the different scene libraries, and finally the first target images corresponding to the target description texts can be combined into the video clips to be stored in the scene libraries to obtain the target scene libraries with the first target images.

In the process, on one hand, the multi-frame images can be automatically matched with a plurality of different scene libraries, and after the multi-frame images are matched, the video clips formed by the multi-frame images are divided into different scene libraries to complete the establishment of the target scene library, so that a worker does not need to mark the video clips and distribute the different video clips to the different scene libraries, and the efficiency of establishing the target scene library is improved; on the other hand, the video clips composed of the images are distributed to different scene libraries through the program, and the situation that the wrong images are stored in the target scene library due to manual distribution is avoided.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.

FIG. 1 is a flow diagram illustrating a method for scenario library creation in accordance with an exemplary embodiment.

Fig. 2 is a diagram illustrating one frame of an image in a video to be processed according to an exemplary embodiment.

Fig. 3 is a block diagram illustrating a scene library creation apparatus according to an example embodiment.

Fig. 4 is a functional block diagram schematic diagram of a vehicle (general structure of the vehicle) shown according to an exemplary embodiment.

Fig. 5 is a block diagram illustrating an apparatus (a general structure of a mobile terminal) according to an example embodiment.

Fig. 6 is a block diagram of an apparatus (general structure of a server) shown according to an exemplary embodiment.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

It should be noted that all actions of acquiring signals, information or data in the present application are performed under the premise of complying with the corresponding data protection regulation policy of the country of the location and obtaining the authorization given by the owner of the corresponding device.

Fig. 1 is a flowchart illustrating a scene library establishing method according to an exemplary embodiment, where as shown in fig. 1, the scene library establishing method is used in a terminal and includes the following steps.

In step S11, a multi-frame image of the video to be processed is subjected to language description, and multiple description texts of the multi-frame image are obtained.

In the step, a worker can drive the road mining vehicle and shoot each scene on the road by controlling the movement of the road mining vehicle so as to obtain a video to be processed; and then, dividing the video to be processed into multi-frame images, identifying the multi-frame images in the video to be processed by using a Natural Language Processing (NLP) technology, and describing the multi-frame images by using a Natural Language to obtain a plurality of description texts of the multi-frame images.

The video to be processed comprises a plurality of frames of images, and the plurality of frames of images can be each frame of image of the video to be processed, or a plurality of frames of images acquired every other frame or frames in the video to be processed. And each frame of the multiple frames of images respectively comprises one or more driving scenes.

For example, please refer to fig. 2, in which one frame of image of the video to be processed includes a "T-shaped intersection" driving scene, a "sparse people flow" driving scene, a "single lane" driving scene, a "sunny" driving scene, and so on. As can be seen, each frame of image in the video to be processed includes one or more driving scenes.

One frame of image corresponds to one description text, and one description text can describe various different driving scenes.

For example, please refer to a frame of image shown in fig. 2, the frame of image may be recognized by NLP technology to obtain a description text of "there is a t-shaped intersection in the frame of image, there are flower beds and trees on two sides of the intersection, the t-shaped intersection is a single lane, the flow of people on the single lane is sparse, the vehicles are sparse, the frame of image is shot in daytime and the weather is clear", and the description text has a plurality of driving scenes such as "the t-shaped intersection", "the single lane", "the weather is clear".

In step S12, matching the labels of the scene library with the multiple description texts of the multiple frame images, respectively, to obtain target description texts matched with the labels.

In this step, the scene library of the same type may include a plurality of scene libraries of different scenes, and the scene libraries of different scenes have different tags. Therefore, scene libraries of a plurality of different scenes can be established in advance, and labels can be distributed to the scene libraries of the different scenes.

The label of the scene library is used for identifying the scene of one or more frames of images stored in the scene library.

Exemplarily, the scene library of the type of the intersection includes a t-shaped intersection scene library, an intersection scene library, a circular intersection scene library, and the like; the scene library of the type of vehicle density comprises a sparse vehicle scene library, a dense vehicle scene library and the like; the weather scene library of the type comprises a day scene library, a night scene library, a rainy scene library, a snowing scene library, a sunny scene library and the like; the scene library of the lane type includes a single-lane scene library, a multi-lane scene library, and the like.

Then, when the tags of different scene libraries are established, a tag "t-junction" can be established for the t-junction scene library, and a tag "ring-junction" can be established for the ring-junction scene library.

When the labels of the scene library are matched with the description texts, each label is respectively matched with the description texts to obtain the target description text corresponding to each label. The target description text corresponding to one label can be one or more.

Illustratively, the plurality of descriptive texts includes: description text 1, description text 2, description text 3, description text 4, and description text 5.

When the labels "intersection" of the "intersection scene library" are respectively matched with the description texts 1 to 5, descriptions having "intersection" in the description texts 1, 2 and 3 are determined, and therefore the description texts 1, 2 and 3 are used as target description texts corresponding to the "intersection".

When the labels "sunny day" of the "sunny day scene library" are respectively matched with the description texts 1 to 5, it is determined that the description texts 3, 4 and 5 have descriptions of "sunny day", and therefore the description texts 3, 4 and 5 are used as target description texts corresponding to the "sunny day".

The target description text corresponding to the label refers to characters in the target description text where the label appears explicitly or descriptions of the label appear explicitly.

In step S13, the first target image corresponding to the target description text is stored in the scene library, so as to obtain a target scene library with the first target image.

In this step, each target description text is used to describe the corresponding first target image, so after the target description text corresponding to the tag is determined, the first target image corresponding to the target description text may be determined according to the corresponding relationship between the target description text and the first target image, and then the first target image is generated into a video clip to be stored in the target scene library corresponding to the tag.

The number of the first target images corresponding to the tags may be one or multiple, and when the number of the first target images is one, the video clips stored in the target scene library are one frame of image; when the number of the first target images is multiple, the video clips stored in the target scene library are multi-frame images.

The target scene library refers to a scene library storing a first target image or a scene library storing a plurality of video clips describing the same scene, so that the target scene library can provide support for automatically identifying the driving scene of the unmanned vehicle.

The one or more frames of images in the same target scene library are all used for showing the same scene, and the same frame of image has different scenes, so that the same frame of image can be stored in the target scene libraries with different scenes.

Illustratively, after the tags of the day scene library, the night scene library and the raining scene library are respectively "day", "night" and "raining", and the three tags are respectively matched with the multi-frame images, it is determined that the target description text corresponding to "day" has description texts 1, 2, 3, 4, 5, 6, 7, the target description text corresponding to "night" has description texts 3, 4, 5, 6, 7, and the target description text corresponding to "raining" has description texts 2, 3, 4, 5. Therefore, 7 frames of images described in the description texts 1, 2, 3, 4, 5, 6, 7 are stored in the daytime scene library, 5 frames of images described in the description texts 3, 4, 5, 6, 7 are stored in the night scene library, and 4 frames of images described in the description texts 2, 3, 4, 5 are stored in the rainy scene library.

Therefore, multiple frames of different images of the same scene are stored in the same scene library, and the multiple frames of different images can form a video clip.

By the scene library establishing method, the multi-frame images of the videos to be processed obtained by road acquisition can be subjected to language description to obtain a plurality of description texts of the multi-frame images, the description texts are respectively matched with the labels of different scene libraries one by one to determine the target description texts matched with the labels of the different scene libraries, and finally the first target images corresponding to the target description texts can be combined into the video clips to be stored in the scene libraries to obtain the target scene library with the first target images.

In the process, on one hand, the multi-frame images can be automatically matched with a plurality of different scene libraries, and after matching, the video clips formed by the multi-frame images are divided into different scene libraries to complete the establishment of the target scene library, so that workers do not need to mark the video clips or distribute different video clips to different scene libraries, and the efficiency of establishing the target scene library is improved; on the other hand, the video clips composed of the images are distributed to different scene libraries through the program, and the situation that the wrong images are stored in the target scene library due to manual distribution is avoided.

In a possible implementation manner, the tags of the scene library can be respectively matched with a plurality of description texts of a plurality of frames of images through the target model, and the method specifically comprises the following steps:

in step S21, a target model is obtained according to the training description text and the label corresponding to the training description text.

In this step, different target models correspond to different types of scene libraries, and the same target model may match tags of multiple scene libraries of different scenes in the same type of scene library with multiple description texts to obtain a target description text corresponding to the tag.

Exemplarily, labels of scene libraries such as a T-shaped intersection scene library, an annular intersection scene library and the like in the scene library of the type of the intersection can be matched with a plurality of description texts through an intersection identification model; tags of scene libraries such as a day scene library, a night scene library, a rain scene library, a snow scene library, a sunny scene library and the like in the scene library of the type of weather can be matched with a plurality of description texts through a weather identification model.

Then, when training the model, for different types of models, different training description texts and labels may be used to train the models, so as to obtain different types of target models.

Specifically, when training a model of the same type, the model may be trained using training description texts of a plurality of different scenarios of the same type.

Illustratively, a training description text of 'T-shaped intersection is in the frame image' and a label 'T-shaped intersection' can be used for training the intersection recognition model; the training description text of 'the annular intersection is in the frame image' can be also used for training the intersection recognition model together with the label 'the annular intersection'; the training description text of the cross road can be also trained with the label of the cross road to train the road intersection recognition model.

When the model is trained, the model may be trained by using the vocabulary same as the labels as the training description text, or by using the vocabulary for describing the labels as the training description text.

For example, the "image of the frame is a sunny day" description text and a "sunny day" label, the description text same as the label may be used to train the model, and the "image of the frame is in a state where the weather is relatively sunny", the description text used to describe the label "sunny day" may be used to train the model, and the disclosure is not limited herein.

In step S22, a target model is determined according to the labels of the scene library.

In this step, a scene library with a larger classification to which the scene library belongs may be determined according to the labels of the scene libraries of different scenes, and then the target model may be determined according to the correspondence between the scene library with the larger classification and the target model.

For example, when the label of the scene library is "crossroad", it may be determined that the crossroad scene library belongs to the crossroad scene library, and then the target model is determined to be the crossroad identification model according to the corresponding relationship between the crossroad scene library and the crossroad identification model.

In step S23, matching the labels of the scene library with the description texts of the multi-frame images respectively through the target model, so as to obtain target description texts matched with the labels.

In this step, for the tags of different scene libraries, different target models may be used to match the tags with the description text, so as to obtain the target description text matched with the tags.

Specifically, when the label of the scene library is a cross intersection, an intersection identification model can be adopted for matching; when the label of the scene library is 'sunny', a weather identification model can be adopted for matching.

Through different types of target models, the labels of the scene libraries in different types of scene libraries are matched, and the target description texts corresponding to the labels can be determined in a targeted manner, so that the matching process is more accurate, and the obtained target description texts are more accurate.

In a possible embodiment, in the case of an identification error of the NLP technology, the number of the identified description texts is reduced, and when the number of the description texts is reduced, the number of the target description texts obtained by matching with the tags is also reduced, which in turn results in a smaller number of first target images obtained according to a smaller number of target description texts, and a video clip composed of the smaller number of first target images may be inaccurate, in order to ensure the accuracy of the video clip stored in the target scene library, the present disclosure further includes the following steps:

in step S31, when the number of the acquired first target images is greater than a first preset number, obtaining a first target video clip according to the timestamps of a plurality of frames of the first target images; and storing the first target video clip into the scene library to obtain a target scene library with the first target video clip.

In this step, in the case where the NLP technology identifies an error, the number of the identified description texts is small, and naturally, the number of the target description texts and the first target images matched by the small number of description texts is also small. Therefore, in the case that the number of the acquired first target images is greater than the first preset number, it indicates that the number of the first target images recognized by the NLP technology is large, and the NLP technology is accurate in recognition. At this time, the first target video segment may be obtained according to the time stamp of the first target image of the plurality of frames.

The video to be processed is a video capable of being continuously played, the change among all images in the video is smooth, under the condition that the plurality of description texts are identified to be matched with the labels of the scene library through an NLP technology, the target images corresponding to the plurality of description texts are basically images which are arranged according to a time sequence, and the video clips formed by the plurality of target images are basically unchanged. Therefore, in the case that the number of the acquired first target images is larger than the first preset number, the plurality of first target images can be combined into the first target video clip to be stored in the target scene library as a material of a driving scene.

For example, a video to be processed has 3 seconds (S), and 1S can play 24 frames of images, if 1S displays one frame, the 24 frames of images in 1S are basically unchanged images, and the frame of the video clip composed of these images is basically unchanged when played. Therefore, when the number of the acquired first target images is greater than the first preset number, it indicates that the acquired multi-frame first target images are multi-frame images which are used for representing the same picture and are arranged according to the time sequence, and at this time, the multi-frame first target images can be combined into a first target video clip and stored in the target scene library.

When multiple frames of first target images are combined into a first target video clip, the first target video clip can be intercepted from the video to be processed according to the timestamp with the minimum numerical value and the timestamp with the maximum numerical value in multiple timestamps of the multiple frames of first target images.

Illustratively, the timestamps of the first target images of multiple frames are 25ms, 26ms, 27ms, 28ms and 29ms respectively, the starting timestamp of the to-be-processed video playing is 0, and the ending timestamp is 330ms, so that the 25ms to 29ms video segments can be intercepted from the to-be-processed video segments from 0 to 330ms as the first target video segments.

The first target video clip stored in the target scene library is a video clip for playing the same picture, and the same driving scene in the multi-frame image in the first target video clip is basically kept unchanged.

The first preset number may be 5 or 6, and is set according to actual conditions.

In step S32, when the number of the acquired first target images is smaller than a second preset number, a second target video clip is obtained according to the timestamps of multiple frames of second target images adjacent to the first target image, or according to the timestamps of the first target image and the timestamps of the multiple frames of second target images; and storing the second target video clip into the scene library to obtain a target scene library with the second target video clip.

In this step, when the number of the matched first target images is smaller than the second preset number, it indicates that the number of the first target images identified by the NLP technique is small, and the NLP technique may identify an error, which results in that the number of the first target images obtained by matching is small. In order to guarantee the accuracy of the video clips stored in the target scene library, the first target image and the multiple frames of associated images adjacent to the first target image can be subjected to feature comparison respectively, and one or more associated images are used as second target images; and then obtaining a second target video clip according to the timestamps of the plurality of second target images or the timestamp of the first target image and the timestamps of the plurality of second target images, and storing the second target video clip into a scene library to obtain a target scene library.

And when the characteristics of any one of the associated images are matched with the characteristics of the first target image, continuously performing characteristic comparison on the next frame of associated image in the multiple frames of associated images with the first target image until the associated image which is not matched with the characteristics of the first target image appears, and obtaining one or more associated images matched with the characteristics of the first target image.

Illustratively, the video to be processed plays the image A, B, C, D, E, F, G, H, the timestamps of the images A, B, C, D, E, F, G, H are 2ms, 4ms, 6ms, 8ms, 10ms, 12ms, 14ms, and 16ms, respectively, if the first target image is image D, the image D may be first feature-compared with the image C, E, if feature matching is determined, the image D is continuously compared with the image B, F, if feature matching is determined, the image D is not matched with the image F, the image D is continuously compared with the image a, and if feature matching is determined, the image D is second target image B, C, E.

The second target image is a multi-frame associated image adjacent to the first target image, and the features of the second target image and the first target image are matched, so that when the first target image and the second target image are combined into the second target video clip, the second target video clip can also be used for playing one picture.

The matching between the first target image and the second target image means that the images between the first target image and the second target image are similar, and when the first target image and the second target image are combined into the second target video clip, the played pictures are basically unchanged.

Wherein, in the case that the second target image is an associated image located before the timestamp of the first target image, what is contained in the second target video segment is the first target image and the second target image located before the timestamp of the first target image. At this time, the second target video segment may be derived from the timestamp of the first target image and the timestamps of the plurality of associated images that precede the timestamp of the first target image.

Specifically, the timestamp with the minimum value may be determined from a plurality of timestamps located before the timestamp of the first target image, and the timestamp with the minimum value may be used as the start time of the second target video segment; and taking the time stamp of the first target image as the end time of the second target video clip, and intercepting the second target video clip from the video to be processed.

Wherein, in the case that the second target image is an associated image located after the time stamp of the second target image, what is contained in the second target video segment is the first target image and the second target image located after the time stamp of the first target image. The second target video segment may now be derived from the timestamp of the first target image and the timestamps of the plurality of associated images that are located after the timestamp of the first target image.

Specifically, the timestamp with the largest value may be determined from a plurality of timestamps located after the timestamp of the first target image, and the timestamp with the largest value may be used as the end time of the second target video segment; and taking the time stamp of the first target image as the starting time of the second target video clip, and intercepting the second target video clip from the video to be processed.

When the second target image includes an associated image before the timestamp of the first target image and an associated image after the timestamp of the first target image, the second target video clip may be obtained according to the timestamp of the second target image.

Specifically, the timestamp with the minimum value can be determined from the timestamps of the plurality of second target images, and the timestamp with the minimum value is used as the starting time of the second target video segment; and determining the timestamp with the largest title as the end time of the second target video clip from the timestamps of the plurality of second target images, and cutting out the second target video clip from the video to be processed.

When the first target image is subjected to feature matching with multiple adjacent frames of associated images, the following manners may be collected to determine whether the associated image matches with the first target image.

Mode 1: the pixels at the same position of the first target image and the second target image may be subtracted to obtain a sum of pixel differences between the two frames of images, and if the sum of pixel differences is smaller than a first preset difference value, it is determined that the first target image is similar to the second target image, and the features of the first target image and the second target image are matched.

Mode 2: the histogram of the first target image may be subtracted from the histogram of the second target image to obtain a difference between the histograms of the two frames of images, and if the difference between the histograms is smaller than a second preset difference value, the feature matching between the first target image and the second target image is determined.

Mode 3: the first target image and the second target image may be compared with each other, and if the objects at the same position are the same, the feature matching between the first target image and the second target image is determined.

The second preset number is less than or equal to the first preset number, for example, the second preset number may be 1 or 2; the first predetermined difference is different from the second predetermined difference.

According to the scene library establishing method, after the target description texts corresponding to the labels of the scene library are matched through the target model, the first target images can be determined, and under the condition that the number of the first target images is smaller than the second preset number, the first target images and the adjacent associated images are compared one by one to obtain the second target images matched with the first target images.

In the process, even if the number of the description texts identified by the NLP technology is small, and the target description texts matched by the target model are small, the second target image which represents the same scene and the same picture together with the first target image can be determined by comparing the first target image with the adjacent associated images one by one, so that the video clip obtained by combining the first target image and the second target image can be more accurate, and the material stored in the scene library is also accurate.

In one possible embodiment, in order to improve the accuracy of the recognized description text, the present disclosure further includes: and performing language description on the multi-frame image in the video to be processed and the multiple sections of voice in the video to be processed to obtain multiple description texts of the multi-frame image.

Specifically, when the roadman shoots the video to be processed, the roadman can explain each scene appearing in the video to be processed, and thus, the video to be processed includes a plurality of sections of voices for explaining the video to be processed.

For example, as shown in fig. 2, if 15 frames of images are all used to describe the screen shown in fig. 2, and the screen shown in fig. 2 includes driving scenes such as a "t-shaped intersection" driving scene, a "sunny" driving scene, a "single-lane driving scene", and the like, then when the screen shown in fig. 2 is used by a road miner, it can explain that "it is sunny now, there is no vehicle on the road surface, the road is a single lane, and the intersection is a t-shaped intersection", then the speech of this explanation will cover the above 15 frames of images, and both the speech and the 15 frames of images are used to describe multiple scenes under the same screen.

When the NLP technology is used for identifying the video to be processed, not only the image but also the voice in the video to be processed are identified, and the voice and the image are identified together as the description text.

Specifically, the NLP technique may determine the description text from each of the multiple frames of images and the speech covering the multiple frames of images.

For example, taking a frame of image as an example, if the description text of the image is "there is a t-shaped intersection in the frame of image, the frame of image is shot in the daytime and the weather is clear", and the description text of the voice is "there is no vehicle on the road, there is a single lane and the intersection is a t-shaped intersection at present in a clear day", at this time, the description text of the voice is obviously more than the description text of the image, the redundant description text can be complemented with the description text that is not obtained after the image recognition, so as to obtain more description texts for representing a frame of image.

By the scene library establishing method, the description text of the image can be obtained according to the voice of the video to be processed and the image of the video to be processed, and therefore the description text of the image can be complemented through the description text of the voice so as to obtain more accurate description texts.

Fig. 3 is a block diagram illustrating a scene library creation apparatus according to an example embodiment. Referring to fig. 3, the apparatus 120 includes a description module 121, a matching module 122, and a storage module 123.

The description module 121 is configured to perform language description on multiple frames of images of a video to be processed to obtain multiple description texts of the multiple frames of images;

the matching module 122 is configured to match the labels of the scene library with the plurality of description texts of the multi-frame images respectively to obtain target description texts matched with the labels;

the storage module 123 is configured to store the first target image corresponding to the target description text in the scene library, so as to obtain a target scene library with the first target image.

Optionally, the matching module 122 comprises:

an object model determination module configured to determine an object model from tags of the scene library;

and the first matching module is configured to match the labels of the scene library with the plurality of description texts of the multi-frame images respectively through the target model to obtain the target description texts matched with the labels.

Optionally, the storage module 123 includes:

the first acquisition module is configured to obtain a first target video clip according to the timestamps of a plurality of frames of first target images under the condition that the number of acquired first target images is greater than a first preset number;

A first storage module configured to store the first target video segment into the scene library, resulting in a target scene library having the first target video segment.

Optionally, the storage module 123 includes:

the second acquisition module is configured to obtain a second target video clip according to the time stamp of multiple frames of second target images adjacent to the first target image or according to the time stamp of the first target image and the time stamps of the multiple frames of second target images under the condition that the number of the acquired first target images is smaller than a second preset number;

a second storage module configured to store the second target video segment into the scene library, resulting in a target scene library with the second target video segment.

Optionally, the apparatus 120 further comprises:

the characteristic comparison module is configured to perform characteristic comparison on a plurality of frames of associated images adjacent to the first target image and the first target image respectively; when any one of the associated images is matched with the features of the first target image, continuing to perform feature comparison on the next frame of associated image in the multiple frames of associated images with the first target image until an associated image which is not matched with the features of the first target image appears, and obtaining one or more associated images matched with the features of the first target image;

A second target image determination module configured to determine the one or more associated images as the second target image.

Optionally, the description module 121 includes:

the first description module is configured to perform language description on a multi-frame image in the video to be processed and a plurality of sections of voices in the video to be processed to obtain a plurality of description texts of the multi-frame image;

Optionally, the apparatus 120 further comprises:

and the training module is configured to train the model by using the training description text and the label corresponding to the training description text to obtain the target model.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

Referring to fig. 4, fig. 4 is a functional block diagram of a vehicle 600 according to an exemplary embodiment. The vehicle 600 may be configured in a fully or partially autonomous driving mode. For example, the vehicle 600 may acquire environmental information of its surroundings through the sensing system 620 and derive an automatic driving strategy based on an analysis of the surrounding environmental information to implement full automatic driving, or present the analysis result to the user to implement partial automatic driving.

The vehicle 600 may include various subsystems such as an infotainment system 610, a perception system 620, a decision control system 630, a drive system 640, and a computing platform 650. Alternatively, vehicle 600 may include more or fewer subsystems, and each subsystem may include multiple components. In addition, each of the sub-systems and components of the vehicle 600 may be interconnected by wire or wirelessly.

In some embodiments, the infotainment system 610 may include a communication system 611, an entertainment system 612, and a navigation system 613.

The communication system 611 may comprise a wireless communication system that may communicate wirelessly with one or more devices, either directly or via a communication network. For example, the wireless communication system may use 3G cellular communication, such as CDMA, EVD0, GSM/GPRS, or 4G cellular communication, such as LTE. Or 5G cellular communication. The wireless communication system may communicate with a Wireless Local Area Network (WLAN) using WiFi. In some embodiments, the wireless communication system may utilize an infrared link, bluetooth, or ZigBee to communicate directly with the device. Other wireless protocols, such as various vehicular communication systems, for example, a wireless communication system may include one or more Dedicated Short Range Communications (DSRC) devices that may include public and/or private data communications between vehicles and/or roadside stations.

The entertainment system 612 may include a display device, a microphone and a sound, and a user may listen to a radio in the car based on the entertainment system, playing music; or the mobile phone is communicated with the vehicle, the screen projection of the mobile phone is realized on the display equipment, the display equipment can be in a touch control mode, and a user can operate the display equipment by touching the screen.

In some cases, the voice signal of the user may be captured by a microphone, and certain control of the vehicle 600 by the user, such as adjusting the temperature in the vehicle, etc., may be implemented according to the analysis of the voice signal of the user. In other cases, music may be played to the user through a sound.

The navigation system 613 may include a map service provided by a map provider to provide navigation of a route for the vehicle 600, and the navigation system 613 may be used in conjunction with a global positioning system 621 and an inertial measurement unit 622 of the vehicle. The map service provided by the map supplier can be a two-dimensional map or a high-precision map.

The sensing system 620 may include several sensors that sense information about the environment surrounding the vehicle 600. For example, the sensing system 620 may include a global positioning system 621 (the global positioning system may be a GPS system, a beidou system or other positioning system), an Inertial Measurement Unit (IMU) 622, a laser radar 623, a millimeter wave radar 624, an ultrasonic radar 625, and a camera 626. The sensing system 620 may also include sensors of internal systems of the monitored vehicle 600 (e.g., an in-vehicle air quality monitor, a fuel gauge, an oil temperature gauge, etc.). Sensor data from one or more of these sensors may be used to detect the object and its corresponding characteristics (position, shape, orientation, velocity, etc.). Such detection and identification is a critical function of the safe operation of the vehicle 600.

Global positioning system 621 is used to estimate the geographic location of vehicle 600.

The inertial measurement unit 622 is used to sense a pose change of the vehicle 600 based on the inertial acceleration. In some embodiments, inertial measurement unit 622 may be a combination of accelerometers and gyroscopes.

Lidar 623 utilizes laser light to sense objects in the environment in which vehicle 600 is located. In some embodiments, lidar 623 may include one or more laser sources, laser scanners, and one or more detectors, among other system components.

The millimeter-wave radar 624 utilizes radio signals to sense objects within the surrounding environment of the vehicle 600. In some embodiments, in addition to sensing objects, the millimeter-wave radar 624 may also be used to sense the speed and/or heading of objects.

The ultrasonic radar 625 may sense objects around the vehicle 600 using ultrasonic signals.

The camera 626 is used to capture image information of the surroundings of the vehicle 600. The image capturing device 626 may include a monocular camera, a binocular camera, a structured light camera, a panoramic camera, and the like, and the image information acquired by the image capturing device 626 may include still images or video stream information.

Decision control system 630 includes a computing system 631 that makes analytical decisions based on information acquired by sensing system 620, decision control system 630 further includes a vehicle control unit 632 that controls the powertrain of vehicle 600, and a steering system 633, throttle 634, and brake system 635 for controlling vehicle 600.

The computing system 631 may be operable to process and analyze the various information acquired by the perception system 620 in order to identify objects, and/or features in the environment surrounding the vehicle 600. The targets may include pedestrians or animals, and the objects and/or features may include traffic signals, road boundaries, and obstacles. The computing system 631 may use object recognition algorithms, Structure From Motion (SFM) algorithms, video tracking, and the like. In some embodiments, the computing system 631 may be used to map an environment, track objects, estimate the speed of objects, and so on. The computing system 631 may analyze the various information obtained and derive a control strategy for the vehicle.

The vehicle controller 632 may be used to perform coordinated control on the power battery and the engine 641 of the vehicle to improve the power performance of the vehicle 600.

Steering system 633 is operable to adjust the heading of vehicle 600. For example, in one embodiment, a steering wheel system.

The throttle 634 is used to control the operating speed of the engine 641 and thus the speed of the vehicle 600.

The braking system 635 is used to control the deceleration of the vehicle 600. The braking system 635 may use friction to slow the wheel 644. In some embodiments, the braking system 635 may convert the kinetic energy of the wheels 644 into electrical current. The braking system 635 may also take other forms to slow the rotational speed of the wheels 644 to control the speed of the vehicle 600.

The drive system 640 may include components that provide powered motion to the vehicle 600. In one embodiment, the drive system 640 may include an engine 641, an energy source 642, a transmission 643, and wheels 644. The engine 641 may be an internal combustion engine, an electric motor, an air compression engine, or other types of engine combinations, such as a hybrid engine consisting of a gasoline engine and an electric motor, a hybrid engine consisting of an internal combustion engine and an air compression engine. The engine 641 converts the energy source 642 into mechanical energy.

Examples of energy sources 642 include gasoline, diesel, other petroleum-based fuels, propane, other compressed gas-based fuels, ethanol, solar panels, batteries, and other sources of electrical power. The energy source 642 may also provide energy to other systems of the vehicle 600.

The transmission 643 may transmit mechanical power from the engine 641 to the wheels 644. The transmission 643 may include a gearbox, a differential, and a drive shaft. In one embodiment, the transmission 643 may also include other components, such as clutches. Wherein the drive shaft may include one or more axles that may be coupled to one or more wheels 644.

Some or all of the functionality of the vehicle 600 is controlled by the computing platform 650. The computing platform 650 can include at least one first processor 651, which first processor 651 can execute instructions 653 stored in a non-transitory computer-readable medium, such as first memory 652. In some embodiments, the computing platform 650 may also be a plurality of computing devices that control individual components or subsystems of the vehicle 600 in a distributed manner.

The first processor 651 may be any conventional processor, such as a commercially available CPU. Alternatively, the first processor 651 may also include a processor such as a Graphics Processor Unit (GPU), a Field Programmable Gate Array (FPGA), a System On Chip (SOC), an Application Specific Integrated Circuit (ASIC), or a combination thereof. Although fig. 4 functionally illustrates a processor, memory, and other elements of a computer in the same block, those skilled in the art will appreciate that the processor, computer, or memory may actually comprise multiple processors, computers, or memories that may or may not be stored within the same physical housing. For example, the memory may be a hard drive or other storage medium located in a different housing than the computer. Thus, reference to a processor or computer will be understood to include reference to a collection of processors or computers or memories that may or may not operate in parallel. Rather than using a single processor to perform the steps described herein, some components, such as the steering component and the retarding component, may each have their own processor that performs only computations related to the component-specific functions.

In the embodiment of the present disclosure, the first processor 651 may execute the scene library establishing method described above.

In various aspects described herein, the first processor 651 may be located remotely from the vehicle and in wireless communication with the vehicle. In other aspects, some of the processes described herein are executed on a processor disposed within the vehicle and others are executed by a remote processor, including taking the steps necessary to perform a single maneuver.

In some embodiments, the first memory 652 can contain instructions 653 (e.g., program logic), which instructions 653 can be executed by the first processor 651 to perform various functions of the vehicle 600. The first memory 652 may also contain additional instructions, including instructions to send data to, receive data from, interact with, and/or control one or more of the infotainment system 610, the perception system 620, the decision control system 630, the drive system 640.

In addition to instructions 653, first memory 652 may also store data such as road maps, route information, the location, direction, speed, and other such vehicle data of the vehicle, as well as other information. Such information may be used by the vehicle 600 and the computing platform 650 during operation of the vehicle 600 in autonomous, semi-autonomous, and/or manual modes.

The computing platform 650 may control functions of the vehicle 600 based on inputs received from various subsystems (e.g., the drive system 640, the perception system 620, and the decision control system 630). For example, computing platform 650 may utilize input from decision control system 630 in order to control steering system 633 to avoid obstacles detected by sensing system 620. In some embodiments, the computing platform 650 is operable to provide control over many aspects of the vehicle 600 and its subsystems.

Optionally, one or more of these components described above may be mounted separately from or associated with the vehicle 600. For example, the first memory 652 may exist partially or completely separate from the vehicle 600. The aforementioned components may be communicatively coupled together in a wired and/or wireless manner.

Optionally, the above components are only an example, in an actual application, components in the above modules may be added or deleted according to an actual need, and fig. 4 should not be construed as limiting the embodiment of the present disclosure.

An autonomous automobile traveling on a roadway, such as vehicle 600 above, may identify objects within its surrounding environment to determine an adjustment to the current speed. The object may be another vehicle, a traffic control device, or another type of object. In some examples, each identified object may be considered independently, and based on the respective characteristics of the object, such as its current speed, acceleration, separation from the vehicle, etc., may be used to determine the speed at which the autonomous vehicle is to be adjusted.

Optionally, the vehicle 600 or a sensing and computing device associated with the vehicle 600 (e.g., computing system 631, computing platform 650) may predict the behavior of the identified object based on characteristics of the identified object and the state of the surrounding environment (e.g., traffic, rain, ice on the road, etc.). Optionally, each of the identified objects is dependent on the behavior of each other, so all of the identified objects can also be considered together to predict the behavior of a single identified object. The vehicle 600 is able to adjust its speed based on the predicted behavior of the identified object. In other words, the autonomous vehicle is able to determine what steady state the vehicle will need to adjust to (e.g., accelerate, decelerate, or stop) based on the predicted behavior of the object. In this process, other factors may also be considered to determine the speed of the vehicle 600, such as the lateral position of the vehicle 600 in the road being traveled, the curvature of the road, the proximity of static and dynamic objects, and so forth.

In addition to providing instructions to adjust the speed of the autonomous vehicle, the computing device may provide instructions to modify the steering angle of the vehicle 600 to cause the autonomous vehicle to follow a given trajectory and/or to maintain a safe lateral and longitudinal distance from objects in the vicinity of the autonomous vehicle (e.g., vehicles in adjacent lanes on the road).

The vehicle 600 may be any type of vehicle, such as a car, a truck, a motorcycle, a bus, a boat, an airplane, a helicopter, a recreational vehicle, a train, etc., and the disclosed embodiment is not particularly limited.

Fig. 5 is a block diagram illustrating an apparatus 800 for scene library creation according to an example embodiment. For example, the apparatus 800 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like.

Referring to fig. 5, the apparatus 800 may include one or more of the following components: a first processing component 802, a second memory 804, a first power component 806, a multimedia component 808, an audio component 810, a first input/output interface 812, a sensor component 814, and a communication component 816.

The first processing component 802 generally controls overall operation of the device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The first processing component 802 may include one or more second processors 820 to execute instructions to perform all or a portion of the steps of the scene library creation method described above. Further, the first processing component 802 can include one or more modules that facilitate interaction between the first processing component 802 and other components. For example, the first processing component 802 may include a multimedia module to facilitate interaction between the multimedia component 808 and the first processing component 802.

The second memory 804 is configured to store various types of data to support operations at the apparatus 800. Examples of such data include instructions for any application or method operating on device 800, contact data, phonebook data, messages, pictures, videos, and so forth. The second memory 804 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

A first power supply component 806 provides power to the various components of the device 800. The first power component 806 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the apparatus 800.

The multimedia component 808 includes a screen that provides an output interface between the device 800 and the user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 808 includes a front facing camera and/or a rear facing camera. The front camera and/or the rear camera may receive external multimedia data when the device 800 is in an operating mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 810 is configured to output and/or input audio signals. For example, audio component 810 includes a Microphone (MIC) configured to receive external audio signals when apparatus 800 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signal may further be stored in the second memory 804 or transmitted via the communication component 816. In some embodiments, audio component 810 also includes a speaker for outputting audio signals.

The first input/output interface 812 provides an interface between the first processing component 802 and a peripheral interface module, which may be a keyboard, click wheel, button, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor assembly 814 includes one or more sensors for providing various aspects of state assessment for the device 800. For example, the sensor assembly 814 may detect the open/closed status of the device 800, the relative positioning of components, such as a display and keypad of the device 800, the sensor assembly 814 may also detect a change in the position of the device 800 or a component of the device 800, the presence or absence of user contact with the device 800, the orientation or acceleration/deceleration of the device 800, and a change in the temperature of the device 800. Sensor assembly 814 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 816 is configured to facilitate communication between the apparatus 800 and other devices in a wired or wireless manner. The apparatus 800 may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 816 receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 816 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the apparatus 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described scene library creation methods.

In an exemplary embodiment, a non-transitory computer readable storage medium comprising instructions, such as the second memory 804 comprising instructions, executable by the second processor 820 of the apparatus 800 to perform the scene library creation method described above is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

The apparatus may be a part of a stand-alone electronic device, for example, in an embodiment, the apparatus may be an Integrated Circuit (IC) or a chip, where the IC may be one IC or a collection of multiple ICs; the chip may include, but is not limited to, the following categories: a GPU (Graphics Processing Unit), a CPU (Central Processing Unit), an FPGA (Field Programmable Gate Array), a DSP (Digital Signal Processor), an ASIC (Application Specific Integrated Circuit), an SOC (System on Chip, SOC, System on Chip, or System on Chip), and the like. The integrated circuit or chip can be used to execute executable instructions (or codes) to implement the scene library establishment method. Where the executable instructions may be stored in the integrated circuit or chip or may be retrieved from another device or apparatus, for example, where the integrated circuit or chip includes a processor, a memory, and an interface for communicating with other devices. The executable instructions can be stored in the processor, and when the executable instructions are executed by the processor, the scene library establishing method is realized; alternatively, the integrated circuit or chip may receive the executable instructions through the interface and transmit the executable instructions to the processor for execution, so as to implement the scene library establishment method.

In another exemplary embodiment, a computer program product is also provided, which comprises a computer program executable by a programmable apparatus, the computer program having code portions for performing the above-mentioned scene library creation method when executed by the programmable apparatus.

Fig. 6 is a block diagram illustrating an apparatus 1900 for scene library creation according to an example embodiment. For example, the apparatus 1900 may be provided as a server. Referring to FIG. 6, the apparatus 1900 includes a second processing component 1922 further including one or more processors and memory resources represented by a third memory 1932 for storing instructions, e.g., applications, executable by the second processing component 1922. The application programs stored in the third memory 1932 may include one or more modules that each correspond to a set of instructions. Further, the second processing component 1922 is configured to execute instructions to perform the scene library creation method described above.

The device 1900 may also include a second power component 1926 configured to perform power management of the device 1900, a wired or wireless network interface 1950 configured to connect the device 1900 to a network, and a second input/output interface 1958. The device 1900 may operate based on an operating system, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, or the like, stored in the third memory 1932.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A method for establishing a scene library is characterized by comprising the following steps:

and storing the first target image corresponding to the target description text into the scene library to obtain a target scene library with the first target image.

2. The method for establishing the scene library according to claim 1, wherein the step of matching the labels of the scene library with the plurality of description texts of the multi-frame images to obtain the target description texts matched with the labels comprises the following steps:

determining a target model according to the label of the scene library;

3. The method for establishing the scene library according to claim 1, wherein the step of storing a first target image corresponding to the target description text in the scene library to obtain a target scene library with the first target image comprises:

4. The method for establishing the scene library according to claim 1, wherein the step of storing a first target image corresponding to the target description text in the scene library to obtain a target scene library with the first target image comprises:

5. The method of creating a scene library according to claim 4, wherein the second target image is obtained by:

and taking the one or more associated images as the second target image.

6. The method for establishing the scene library according to claim 1, wherein performing language description on a plurality of frames of images of a video to be processed to obtain a plurality of description texts of the plurality of frames of images comprises:

performing language description on a multi-frame image in the video to be processed and a plurality of sections of voices in the video to be processed to obtain a plurality of description texts of the multi-frame image;

7. The method for establishing the scene library according to claim 2, wherein the target model is obtained by training through the following steps:

8. A scene library creation apparatus, comprising:

the storage module is configured to store a first target image corresponding to the target description text into the scene library to obtain a target scene library with the first target image.

9. A vehicle, characterized by comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to:

steps of implementing the method of any one of claims 1 to 7 for building a scene library.

10. A computer-readable storage medium, on which computer program instructions are stored, which program instructions, when executed by a processor, carry out the steps of the scene library creation method of any one of claims 1 to 7.

11. A chip comprising a processor and an interface; the processor is configured to read instructions to perform the steps of the scene library creation method of any one of claims 1 to 7.