CN117893978A

CN117893978A - Environment sensing method, device, storage medium and vehicle

Info

Publication number: CN117893978A
Application number: CN202211230109.5A
Authority: CN
Inventors: 李贵增; 杨冬生; 王欢; 徐驰
Original assignee: BYD Co Ltd
Current assignee: BYD Co Ltd
Priority date: 2022-10-09
Filing date: 2022-10-09
Publication date: 2024-04-16

Abstract

The present disclosure relates to an environment awareness method, an apparatus, a storage medium, and a vehicle, the method comprising: acquiring image information of a current environment; inputting the image information into a pre-trained scene classification model, and acquiring a classification result output by the scene classification model, wherein the classification result comprises each scene in a plurality of scenes and a first confidence coefficient corresponding to each scene; determining a perception result of the current environment according to the classification result and a recognition result output by at least one recognition model based on the image information, wherein the recognition result comprises a recognized recognition object and a second confidence coefficient corresponding to the recognition object, and each scene corresponds to a pre-trained recognition model; and if the sensing result meets the preset condition, outputting the sensing result. The vehicle sensing system and the vehicle sensing method can improve the efficiency and accuracy of the vehicle sensing the surrounding environment.

Description

Environment sensing method, device, storage medium and vehicle

Technical Field

The disclosure relates to the technical field of automatic driving, in particular to an environment sensing method, an environment sensing device, a storage medium and a vehicle.

Background

The visual automobile environment sensing method in the related art is to train a single model to perform reasoning and sensing by using a pre-acquired image data set. However, this approach is only applicable to some general-purpose scenes, and for some long-tail scenes with a small sample size, the environment around the vehicle will not be accurately perceived.

Disclosure of Invention

To overcome the problems in the related art, the present disclosure provides an environment sensing method, apparatus, storage medium, and vehicle.

According to a first aspect of embodiments of the present disclosure, there is provided an environment-aware method, the method comprising:

acquiring image information of a current environment;

inputting the image information into a pre-trained scene classification model, and acquiring a classification result output by the scene classification model, wherein the classification result comprises each scene in a plurality of scenes and a first confidence coefficient corresponding to each scene;

determining a perception result of the current environment according to the classification result and a recognition result output by at least one recognition model based on the image information, wherein the recognition result comprises a recognized recognition object and a second confidence coefficient corresponding to the recognition object, and each scene corresponds to a pre-trained recognition model;

And if the sensing result meets the preset condition, outputting the sensing result.

Optionally, the determining the perceived result of the current environment according to the classification result and the at least one recognition model based on the recognition result output by the image information includes:

according to the classification result, acquiring a scene with the maximum first confidence from the plurality of scenes as a target scene;

and inputting the image information into the recognition model corresponding to the target scene, acquiring a recognition result output by the recognition model corresponding to the target scene, and taking the recognition result as a perception result of the current environment.

Optionally, if the sensing result meets a preset condition, outputting the sensing result includes:

and if the second confidence coefficient in the identification result is larger than or equal to the first threshold value, outputting the identification result as the perception result.

respectively inputting the image information into an identification model corresponding to each scene to obtain an identification result corresponding to each scene, wherein the identification result comprises an identification object and a second confidence coefficient corresponding to the identification object;

For each recognition object, carrying out weighting processing on the second confidence coefficient of the recognition object under each scene and the first confidence coefficient corresponding to each scene to obtain a weighted second confidence coefficient of the recognition object;

and determining the second confidence coefficient of each recognition object and the weighted second confidence coefficient of each recognition object as the perception result.

and if the weighted second confidence coefficient is greater than or equal to a second threshold value, outputting a recognition result corresponding to the weighted second confidence coefficient as the perception result.

Optionally, for each recognition object, weighting the second confidence coefficient of the recognition object under each scene and the first confidence coefficient corresponding to each scene to obtain a weighted second confidence coefficient of the recognition object, including:

normalizing the first confidence coefficient corresponding to each scene in the classification result to obtain a processed first confidence coefficient;

and for each recognition object, carrying out weighting processing on the second confidence coefficient of the recognition object under each scene and the processed first confidence coefficient corresponding to each scene to obtain a weighted second confidence coefficient of the recognition object.

Optionally, the inputting the image information into a pre-trained scene classification model, and obtaining a classification result output by the scene classification model includes:

extracting a region of interest from the image information through a preset rule;

inputting the region of interest into a pre-trained scene classification model, and obtaining a classification result output by the scene classification model.

Optionally, the method further comprises:

acquiring a plurality of image samples of a pre-calibrated identification object;

scene labeling is carried out on the plurality of image samples, and labeled image samples are obtained;

and training to obtain the scene classification model based on the noted image sample.

Optionally, before the training to obtain the scene classification model based on the noted image sample, the method further includes:

and determining that the image samples corresponding to each scene in the marked image samples meet the preset quantity requirement.

Optionally, the method further comprises: and training to obtain an identification model corresponding to each scene based on the image sample corresponding to the scene, wherein the scenes comprise a general scene and a long tail scene, and the image sample corresponding to the general scene comprises the image sample with the label of the long tail scene.

According to a second aspect of embodiments of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the method of the first aspect.

According to a third aspect of embodiments of the present disclosure, there is provided an environment-aware apparatus comprising:

a memory having a computer program stored thereon;

a processor for executing the computer program in the memory to implement the steps of the method of the first aspect.

According to a fourth aspect of embodiments of the present disclosure, there is provided a vehicle comprising: the context awareness apparatus of the third aspect.

The technical scheme provided by the embodiment of the disclosure can comprise the following beneficial effects: acquiring image information of the current environment; inputting the image information into a pre-trained scene classification model, and obtaining a classification result output by the scene classification model, wherein the classification result comprises each scene in a plurality of scenes and a first confidence coefficient corresponding to each scene; then determining a perception result of the current environment according to the classification result and a recognition result output by at least one recognition model based on image information, wherein the recognition result comprises a recognized recognition object and a second confidence coefficient corresponding to the recognition object, and each scene corresponds to a pre-trained recognition model; and if the sensing result meets the preset condition, outputting the sensing result. The scene classification result can reflect the scene related to the current environment, so that the recognition result of the recognition model suitable for the current environment can be selected through the classification result of the scene to determine the final environment perception result, the defect that a single recognition model cannot accurately recognize recognition objects in a few-sample scene is avoided, and the accuracy and efficiency of environment perception are improved.

Additional features and advantages of the present disclosure will be set forth in the detailed description which follows.

Drawings

The accompanying drawings are included to provide a further understanding of the disclosure, and are incorporated in and constitute a part of this specification, illustrate the disclosure and together with the description serve to explain, but do not limit the disclosure. In the drawings:

FIG. 1 is a flow chart illustrating a method of context awareness according to an exemplary embodiment.

Fig. 2 is a flow chart illustrating a method of context awareness according to another exemplary embodiment.

FIG. 3 is a flowchart of a training method for the scene classification model shown in the embodiment of FIG. 2.

FIG. 4 is a flowchart of a training method for the recognition model shown in the embodiment of FIG. 2.

Fig. 5 is a flowchart showing the implementation of steps S240 to S270 in the embodiment shown in fig. 2.

FIG. 6 is a diagram of the recognition model output byte-form representation shown in the embodiment of FIG. 2.

Fig. 7 is a schematic diagram of the procedure of the non-maximum value suppression processing shown in the embodiment of fig. 2.

FIG. 8 is a flow chart of a method of context awareness in a single scenario, as illustrated in the embodiment of FIG. 2.

FIG. 9 is a block diagram illustrating an environment awareness system according to an example embodiment.

Fig. 10 is a schematic diagram illustrating a structure of an environment-aware device according to an exemplary embodiment.

Fig. 11 is a schematic diagram showing a structure of a server according to an exemplary embodiment.

Detailed Description

Specific embodiments of the present disclosure are described in detail below with reference to the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating and illustrating the disclosure, are not intended to limit the disclosure.

It should be noted that, all actions for acquiring signals, information or data in the present disclosure are performed under the condition of conforming to the corresponding data protection rule policy of the country of the location and obtaining the authorization given by the owner of the corresponding device.

The environment sensing method in the related art is to use a single model for reasoning sensing.

Long tail scenes refer to scenes with a large variety, low occurrence probability or burst, and the number of samples of the scenes is small. Most of the automobile use scenes are on standard roads and in normal weather, but long tail scenes such as severe weather, vehicles with illegal parking on roadsides and the like can be met, and the problem of unbalanced data exists. For long-tail scenes, when the data sets are acquired, the data sample sizes of the long-tail scenes are small, and if a single model is used for training, the situation that learning is not performed when loss weights are calculated because the samples of some scenes are too small. If the loss is adjusted through resampling, the normal reasoning of the model in the general scene can be influenced by the increase of the weight of the long tail scene data.

Aiming at the problems, the embodiment provides an environment sensing method, an environment sensing device, a storage medium and a vehicle, which can classify scenes, train identification models in different scenes, and use different identification models for different scenes to identify, thereby improving the sensing efficiency and accuracy of the vehicle to the environment.

The following describes terms of art related to this embodiment:

long tail scene: refers to a scene with a large variety, low occurrence probability or burst, and the number of the scene samples is small.

Region of interest (Region of Interest, ROI): in machine vision and image processing, a region to be processed is outlined from a processed image in a box, circle, ellipse, irregular polygon and the like, and is called a region of interest.

FIG. 1 is a flow chart illustrating an environmental awareness method, as shown in FIG. 1, that may be used in a vehicle, according to an exemplary embodiment, the method may include the steps of:

s110, acquiring image information of the current environment.

In some embodiments, the vehicle is provided with a camera, and the vehicle can acquire an image of the current environment through the camera in real time, so as to obtain image information.

S120, inputting the image information into a pre-trained scene classification model, and obtaining a classification result output by the scene classification model, wherein the classification result comprises each scene in a plurality of scenes and a first confidence coefficient corresponding to each scene.

In some implementations, the scene classification model may be trained based on a plurality of image samples, wherein each of the plurality of image samples is labeled with a scene tag. The vehicle may input image information to a scene classification model, which may identify a scene in which the current environment is located according to the image information, and output a classification result. It will be appreciated that since there may be instances where multiple scenes coexist, the classification record may include a first confidence level for each of the multiple scenes. Illustratively, the classification results may include, for example: the method comprises the steps of a first confidence coefficient a1 corresponding to a raining scene, a first confidence coefficient a2 corresponding to a night scene and a first confidence coefficient a3 corresponding to a standard highway driving scene.

And S130, determining a perception result of the current environment according to the classification result and a recognition result output by at least one recognition model based on the image information, wherein the recognition result comprises a recognized recognition object and a second confidence coefficient corresponding to the recognition object, and each scene corresponds to one pre-trained recognition model.

The recognition model is trained based on a plurality of image samples with recognition object labels, and can output each recognition object (such as an obstacle, a pedestrian, a lane line, a traffic light and the like) recognized from the image information, and a second confidence corresponding to the recognition object, such as a second confidence b1 corresponding to the obstacle, a second confidence b2 corresponding to the pedestrian and a second confidence b3 corresponding to the lane line, according to the input image information. It can be understood that each recognition model corresponds to a scene in advance, when the recognition model is trained, the image sample can also be added with a corresponding scene label in advance, for example, the image information with a traffic light label can be added with a label of a standard road driving scene, and then the image sample with the scene label can be used for training to obtain the recognition model so that the obtained recognition model corresponds to the scene.

In some embodiments, the higher the first confidence coefficient corresponding to the scene is, the higher the accuracy that the current environment belongs to the scene is, so that the scene with the highest first confidence coefficient can be selected according to the classification result, then the image information is identified by using the identification model corresponding to the scene, and the identification result is used as the perception result of the current environment. For example, if the scene with the highest first confidence is a rainy scene, the image information may be input into a recognition result obtained by a recognition model corresponding to the rainy scene, for example, the recognition result is: the recognition object recognized by the recognition model corresponding to the rainy scene is a pedestrian, and the second confidence corresponding to the pedestrian is 0.85.

And S140, outputting the sensing result if the sensing result meets the preset condition.

Along with the above example, for example, if the preset condition is that the second confidence in the perceived result is greater than or equal to the confidence threshold, and assuming that the confidence threshold is 0.7, it may be determined that the vehicle perceives that a "pedestrian" exists in the current environment, and the perceived result of "pedestrian" may be output.

In some embodiments, the method may also be applied to a server, where the server may receive data uploaded by the vehicle, process the data, and output a final sensing result to the vehicle.

It can be seen that, in this embodiment, by acquiring image information of the current environment; inputting the image information into a pre-trained scene classification model, and obtaining a classification result output by the scene classification model, wherein the classification result comprises each scene in a plurality of scenes and a first confidence coefficient corresponding to each scene; then determining a perception result of the current environment according to the classification result and a recognition result output by at least one recognition model based on image information, wherein the recognition result comprises a recognized recognition object and a second confidence coefficient corresponding to the recognition object, and each scene corresponds to a pre-trained recognition model; and if the sensing result meets the preset condition, outputting the sensing result. The scene classification result can reflect the scene related to the current environment, so that the recognition result of the recognition model suitable for the current environment can be selected through the classification result of the scene to determine the final environment perception result, the defect that a single recognition model cannot accurately recognize recognition objects in a few-sample scene is avoided, and the accuracy and efficiency of environment perception are improved.

FIG. 2 is a flow chart of an environmental awareness method, as shown in FIG. 1, that may be used in a vehicle, according to an exemplary embodiment, the method may include the steps of:

s210, acquiring a plurality of image samples of the pre-calibrated identification object.

Alternatively, the recognition object may include, but is not limited to, pedestrians, obstacles, vehicles, lane lines, traffic lights, etc., and an image that has been calibrated with the recognition object may be obtained as an image sample from an image database, for example; or the images can be acquired on site, marked with the identification objects in the images and then stored as image samples. The plurality of image samples may then be used as an image training dataset.

S220, scene labeling is carried out on the plurality of image samples, and labeled image samples are obtained.

Along with the above example, for the image training data set obtained above, scene classification labeling may be performed on each image sample in the image training data set, and a scene to which each image sample belongs is labeled. For example, for light conditions, field of view, weather is better, image samples with high visibility may be labeled as a generic scene. For the image sample which is difficult to identify, according to the actual situation of the image sample, the image sample can be preferably marked as raining, night, evening, snowing, heavy fog and other special scenes. For the case that some image samples belong to multiple scenes, the coexisting scenes can be marked as the scene category of the image sample. Thus, the recognition model corresponding to the scene can learn the characteristics according to the characteristics of the scene.

And S230, training to obtain the scene classification model based on the noted image sample.

In some embodiments, before step S230, it may further include:

For example, since the number of image samples that can be acquired by some long-tail scenes or special scenes is relatively small, the number of image samples corresponding to each scene in the scene classification model can be balanced by adjusting the image samples corresponding to each scene to meet the preset number requirement. For example, when the number of image samples corresponding to a certain scene is within a specified number range, it may be determined that the image samples corresponding to the scene satisfy a preset number requirement. For a certain number of image samples corresponding to a certain production channel, some may be deleted if more and some may be added if less.

As an example, in practical application, the specific implementation procedure of step S210 to step S230, that is, the training method for the scene classification model, may be implemented, as shown in fig. 3, specifically through steps 101 to 104.

Step 101, the processing step needs to have an image training data set marked with sensing tasks such as pedestrians, obstacles, vehicles, lane lines, traffic lights and the like. The image training data set marks the content and determines the task type of the recognition model which can be trained. Only the task marked by the image training data set can be normally trained. Wherein the perceived task corresponds to the recognition object in the above embodiment.

And 102, in the preprocessing stage, cutting and image enhancement operation is carried out on each image by using the marked data set. The method can also perform data enhancement operations such as small-angle rotation, translation, scaling and the like on the image, take out a region of interest (ROI) and add the region of interest (ROI) into training data, wherein the length and the width of the ROI are consistent with the input dimension of a scene classification model. In this step, the pre-processed data with scene classification tags may be output.

Step 103, in the preprocessed data output in step 102, corresponding to the actual situation, the special scene (long tail scene) data is less than the general scene data. Resampling of the data set output by step 102 is also required, and in particular, resampling may include oversampling and undersampling (where undersampling is the removal of samples from the majority class, and oversampling is the addition of more samples to the minority class). Specifically, the non-general scene data in the preprocessed data outputted in step 102 is sampled multiple times (over-sampled), or only a part of the general scene data is used as training data (under-sampled), so that the distribution of the respective scene data approximately matches the average distribution. Training the scene classification model by using the resampled preprocessing data, and outputting the scene classification training model. Thereby completing training of scene classification models (or attention models) of unbalanced data.

Alternatively, for the problem of unbalance of training data in such multi-classification tasks, similar effects can be achieved by dynamically adjusting the weight of loss during training, in addition to resampling before training. Taking Focal Loss (Focal Loss) as an example: taking the preprocessed data from step 102 as input to a classification model, the classification model will output a classification result [ a ] ₁ ,a ₂ ,…，a _n ]，a _i Confidence levels corresponding to the scenes.

Then, when calculating the loss value, use is made ofAs a training Loss function. Wherein p is _j Accuracy of prediction for the jth scene; gamma ray _j In order to control the weakening force of the training weight of the sample with high accuracy in the j-th scene, the parameter is preset before training, and the larger the parameter is, the larger the weakening force is. Alpha _(t,j) To control the weight of positive and negative samples of the jth scene. If a sample is actually only the ith scene, then p _i ＝a _i ，α _(t,i) ＝α _i For k+.i, then p _k ＝1-a _k ，α _(t,k) ＝1-α _k Wherein alpha is _k The larger the value is, the larger the weight of the positive sample is for the bias of the positive and negative samples in the preset kth scene.

And 104, reserving required network structures and parameters for the model output in the step 103, solidifying and converting the model to generate a model file for reasoning.

In some embodiments, the method further comprises:

And training to obtain an identification model corresponding to each scene based on the image sample corresponding to the scene, wherein the plurality of scenes comprise a general scene and a long tail scene, and the image sample corresponding to the general scene comprises the image sample with the label of the long tail scene.

As an example, in practical application, the training method for the recognition model may be as shown in fig. 4, and specifically may include the following steps:

step 201, training a recognition model of a general scene by using all image samples (whole set) of the image training dataset (hereinafter referred to as dataset) so as to enable the reasoning process of the general scene to have better generalization capability. In this step, a generic scene perception training model may be output.

Step 202, generating image data subsets corresponding to each scene from all images (image data total sets) of the data set according to the labeling result of step 101 in the above embodiment. If there are multiple scenes in the image sample, it is divided into multiple scene data subsets.

And 203, retraining the general scene perception training model trained in the step 201 by using the data subsets corresponding to different scenes (step 202). And after the model is subjected to iterative training, until the error of the model is reduced to a specified range, outputting a training model corresponding to each scene. Such as a pedestrian detection model in a snowy scene, a night lane recognition model, a vehicle detection model in a foggy scene, and the like.

Taking a yolop recognition model under a rainy scene as an example, all pictures marked as the rainy scene are taken out from the output of the step 101 to form a training data subset. Assume that the training data set input in step 101 marks the targets of pedestrians, vehicles, movable areas, lane lines, and the like. The yolop single model can simultaneously predict target detection, a drivable area and lane lines. Then after training in step 203, an identification model is output that can predict pedestrians, vehicles, areas that can be driven, and lane lines in a rainy scene.

And 204, reserving required network structures and parameters for the model output in the step 203, solidifying and converting the model to generate a model file for reasoning.

S240, acquiring image information of the current environment.

The specific embodiment of step S240 can refer to step 110, and thus is not described herein.

S250, inputting the image information into a pre-trained scene classification model, and obtaining a classification result output by the scene classification model. The classification result comprises each scene in the plurality of scenes and a first confidence coefficient corresponding to each scene.

In some embodiments, the specific implementation of step S250 may include:

extracting a region of interest from the image information through a preset rule; inputting the region of interest into a pre-trained scene classification model, and obtaining a classification result output by the scene classification model.

Illustratively, a region of interest (ROI) may be extracted from the image sample, wherein the length and width of the ROI are consistent with the input of the scene classification model. Or the ROI is subjected to image scaling, and the length and the width of the ROI are consistent with the input of a scene classification model. The ROI position should be selected to cover the image as far as possible, including roads, sky, distant places, street lamps, etc., so as to ensure the image information such as road type, weather, illumination condition, etc., and provide the image information to the scene classification model. And the ROI should be selected to avoid some scene independent image data as much as possible or to repeat the image data. Such as the hood of a vehicle, too many sky images. It is generally sufficient to select the middle part of the image. And finally outputting the image data with the same length and width as the input length and width of the scene classification model and the same format.

And S260, determining the perception result of the current environment according to the classification result and the recognition result output by at least one recognition model based on the image information. The recognition result comprises recognized recognition objects and second confidence degrees corresponding to the recognition objects, and each scene corresponds to a pre-trained recognition model.

In some embodiments, the specific implementation of step S260 may include:

step S261A, according to the classification result, acquiring a scene with the highest first confidence degree from the plurality of scenes as a target scene.

For example, it is determined from the classification result that the first confidence corresponding to the rainy scene is 0.4, the first confidence corresponding to the night scene is 0.3, and the first confidence corresponding to the standard highway driving scene is 0.3. Then the raining scene may be determined as the target scene.

Step S262A, inputting the image information into the recognition model corresponding to the target scene, obtaining the recognition result output by the recognition model corresponding to the target scene, and taking the recognition result as the perception result of the current environment.

Along with the above example, the recognition model corresponding to the rainy scene may be used as the recognition model of the current environment, the above image information may be input into the recognition model, and the output of the recognition model may be used as the sensing result of the current environment, for example, the sensing result is: the second confidence that a "pedestrian" is present is 0.7.

In other embodiments, specific embodiments of step S260 may include:

step S261B, inputting the image information to the recognition model corresponding to each scene, to obtain a recognition result corresponding to each scene, where the recognition result includes a recognition object and a second confidence coefficient corresponding to the recognition object.

Illustratively, for example, the current recognition model includes a recognition model a corresponding to a rainy scene, a recognition model B corresponding to a night scene, and a recognition model C corresponding to a standard highway driving scene; the image information x may be input into the recognition model a, the recognition model B, and the recognition model C, respectively. The recognition result output by the recognition model A is assumed to be: the second confidence of "pedestrian" is 0.9 and the second confidence of "obstacle" is 0.6; the second confidence that the recognition result output by the recognition model B is a pedestrian is 0.3; the second confidence that the recognition result output by the recognition model C is "obstacle" is 0.1.

Step S262B, for each recognition object, weighting the second confidence coefficient of the recognition object under each scene and the first confidence coefficient corresponding to each scene to obtain the weighted second confidence coefficient of the recognition object.

With the above example, the first confidence level for a rainy scene, for example, is 0.5; the first confidence level of the night scene is 0.3; the confidence of the standard highway driving scene is 0.2. Then the second confidence of the "pedestrian" weighted by the first confidence is 0.5 x 0.9+0.3 x 0.3=0.54. The second confidence of the "obstacle" weighted by the first confidence is 0.5×0.6+0.3×0.1=0.33. And so on, a second confidence coefficient of each recognition result obtained by weighting the recognition object by the first confidence coefficient.

As an embodiment, a specific embodiment of step S262B may include:

and carrying out normalization processing on the first confidence coefficient corresponding to each scene in the classification result to obtain the processed first confidence coefficient. And for each recognition object, weighting the second confidence coefficient of the recognition object in each scene and the processed first confidence coefficient corresponding to each scene to obtain the weighted second confidence coefficient of the recognition object.

In this embodiment, the first confidence coefficient corresponding to different scenes in the classification result is normalized, so that the weighting process of the second confidence coefficient by using the first confidence coefficient can be facilitated.

Step S263B, determining the weighted second confidence as the sensing result.

S270, outputting the sensing result if the sensing result meets the preset condition.

In some embodiments, the specific implementation of step S270 may include:

S273A, if the second confidence coefficient in the identification result is greater than or equal to the first threshold value, the identification result is output as the perception result.

With the above example, the second confidence level is 0.7, for example, when it is determined that the perceived result is that a "pedestrian" is present. If the first threshold is 0.5, it can be determined that the recognition object "pedestrian" is recognized in the current environment, so that the perceived result of pedestrian can be output. Optionally, the output mode may be output by a display, voice broadcasting, uploading to a server, and the like, which is not limited herein.

In some embodiments, the specific implementation of step S270 may include:

S273B, if the weighted second confidence coefficient is greater than or equal to a second threshold value, outputting the recognition result corresponding to the weighted second confidence coefficient as the perception result.

With the above example, for example, in determining that the perceived result is "pedestrian" the second confidence is 0.54; the second confidence of the "obstacle" is 0.33, and if the second threshold is 0.5, it can be determined that the recognition object "pedestrian" is recognized in the current environment, so that the perception result of pedestrian can be output. The first threshold value and the second threshold value may be the same or different, and are not limited herein.

In practical application, the scene classification model and the recognition model may be an integral perception model, i.e. the perception model comprises two sub-models of the scene classification model and the recognition model.

As an example, in practical application, the specific implementation flow of step S240 to step S270 may be as shown in fig. 5:

s301, acquiring image information matched with the length and width of a perception model through a vehicle-mounted camera, and converting the image information (hereinafter also referred to as image data) into a format required by inputting the perception model.

S302, performing cutting operation on the image information.

A region of interest (ROI) is fetched, wherein the length and width of the ROI are consistent with the requirements required for the input of the scene classification model. Or the ROI is subjected to image scaling, and the length and the width of the ROI are consistent with the requirements required by the input of a scene classification model. The ROI position should be selected to cover the image as far as possible, including roads, sky, distant places, street lamps, etc., so as to ensure the image information such as road type, weather, illumination condition, etc., and provide the image information to the scene classification model. And the ROI should be selected to avoid some scene independent image data as much as possible or to repeat the image data. Such as the hood of a vehicle, too many sky images. It is generally sufficient to select the middle part of the image. And finally, outputting image data with the same length and width as the input length and width of the scene classification model in the step.

S303, after obtaining the image data from S302, outputting the classification result [ a ] by reasoning of the pre-trained scene classification model ₁ ,a ₂ ,…，a _n ]，a _i For the confidence coefficient corresponding to each scene, 0 is less than or equal to a _i Not more than 1, and Sigma a _i =1. From the overall architecture, a _i Respectively, the weights of the unused scenes. The scene classification model can use S302 to cut out the small-size image after the ROI for reasoning due to simple task. The final step in the scene classification model is to normalize the output, which can be processed using a softmax functionK is the length of the input parameter, e ⁱ Is the ith input of softmax.

In the step S304, the image information is acquired from the step S301 and is input into the recognition model of each scene, and the recognition model of each scene is used for reasoning and outputting the recognition object and the confidence coefficient thereof in the scene. For example, in one embodiment, each scene inference is performed without interaction with each other, and all scenes may use different model structures for their own scene features. For example, in the case of the evening scene, a color space conversion can be added in front of the neural network; in rainy scenes, image denoising or superdivision processing can be added in front of the neural network. Taking the example of the target detection yolo, the form of the model output bytes can be shown as fig. 6, after non-maximum value suppression (NMS) processing, the starting coordinate, the length and the width of the detection frame and the category can be obtained, and the confidence level can be the highest item in the category score. For a perceptual model with some no confidence outputs, the confidence may be considered to be 1. As an example, the procedure of the non-maximum value suppression processing may be as shown in fig. 7.

S305 using the scene classification result [ a ] outputted in S303 ₁ ,a ₂ ,…，a _n ]And (3) performing weighting operation on the result output by the step (S304), and obtaining a perception result and a confidence coefficient after weighting the scene classification result.

S306, judging the weighted confidence coefficient by using a threshold, wherein the threshold is mainly determined manually by referring to parameters such as accuracy, recall rate and the like which are verified and tested by the whole model, and generally, 0.7-0.95 is directly taken. And outputting the perception result which is larger than the threshold value as a final result.

In the above embodiment, in order to reduce the computational effort consumption of the reasoning process of the scene classification model, ROI selection, cropping, and reduction of the computational effort by reducing the image size are performed on the input image prior to the process. However, as reasoning operation is performed in each scene, the more scene models are, the more and more calculation amount is.

In order to reduce the amount of computation, in another embodiment, only one recognition model corresponding to the scene with the highest confidence, i.e. the scene with the highest probability, may be used. The embodiment is applicable to the case of a single scene, and the coexistence of a plurality of scenes, such as at the time of the sunset, or raining, can be avoided. As an example, as shown in fig. 8, a specific implementation manner of the present embodiment may be as shown in steps S401 to S405:

S401 to S402, and the steps are the same as S301 to S302.

S403, after inputting the extracted ROI image data into the scene classification model, outputting a scene classification result, wherein the expression form of the scene classification result can be S303 form [ a ] ₁ ,a ₂ ,…，a _n ]. The scene classification model in this step may not be normalized in the reasoning process, because S404 takes the maximum value in the output vector of S403 and uses its corresponding scene model for reasoning.

S404, selecting an identification model corresponding to the scene with the maximum confidence from the vectors output in S403 as an inference model. The recognition reasoning is performed using the image data S401 obtained from the vehicle, and the recognition result and the confidence are output.

S405, judging the confidence coefficient obtained in the step 404 by using a confidence coefficient threshold, wherein the confidence coefficient threshold can be mainly determined by referring to parameters such as accuracy, recall rate and the like tested by the whole model verification, and optionally, the confidence coefficient threshold is generally directly 0.7-0.95. As one example, an identified object with a confidence greater than the confidence threshold, as obtained in step 404, may be output as a final result.

Therefore, in this embodiment, the sample size for the long-tail scene is small, and the environmental perception performance is low in the long-tail scene using a single model; if strategies such as resampling are used, the problem of general representation capability of the model is affected, corresponding data are used for retraining according to different categories by classifying the automobile environment perception scenes, and the corresponding scene model is used according to scene classification results as weights during reasoning, so that general and long tail scenes are isolated, data with few common characteristics are prevented, and training/reasoning is performed on one model; the long tail scene is more concerned, the special scene is processed in a targeted manner, and the recognition rate is high; the method has higher controllability, and can be purposefully optimized for scenes with high recognition difficulty. In addition, aiming at the problem that characteristics with uneven sample characteristic distribution and few samples are difficult to learn, a sample with similar characteristic distribution is used as a scene and a training set, and a single perception model is used for training.

FIG. 9 is a flow chart of an environmental awareness system, as shown in FIG. 9, the system 50 may include: an image information acquisition module 51, a classification result acquisition module 52, a perception result determination module 53, and an output module 54. Wherein:

the image information obtaining module 51 is configured to obtain image information of a current environment.

The classification result obtaining module 52 is configured to input the image information to a pre-trained scene classification model, and obtain a classification result output by the scene classification model, where the classification result includes each of the plurality of scenes and a first confidence coefficient corresponding to each scene.

The sensing result determining module 53 is configured to determine a sensing result of the current environment according to the classification result and a recognition result output by at least one recognition model based on the image information, where the recognition result includes a recognized recognition object and a second confidence coefficient corresponding to the recognition object, and each scene corresponds to a pre-trained recognition model.

And the output module 54 is configured to output the sensing result if the sensing result meets a preset condition.

In some embodiments, the sensing result determining module 53 includes:

and the target scene determining sub-module is used for acquiring a scene with the maximum first confidence degree from the plurality of scenes as a target scene according to the classification result.

And the perception result acquisition sub-module is used for inputting the image information into the recognition model corresponding to the target scene, acquiring the recognition result output by the recognition model corresponding to the target scene, and taking the recognition result as the perception result of the current environment.

In some embodiments, the output module 54 is specifically configured to output the recognition result as the sensing result if the second confidence level in the recognition result is greater than or equal to the first threshold.

In some embodiments, the sensing result determining module 53 includes:

the recognition result acquisition sub-module is used for respectively inputting the image information into a recognition model corresponding to each scene to obtain a recognition result corresponding to each scene, wherein the recognition result comprises a recognition object and a second confidence coefficient corresponding to the recognition object.

The second confidence coefficient acquisition sub-module is used for carrying out weighting processing on the second confidence coefficient of each recognition object in each scene and the first confidence coefficient corresponding to each scene aiming at each recognition object to obtain the weighted second confidence coefficient of the recognition object.

The perception result determining submodule is used for determining each recognition object and the weighted second confidence coefficient of each recognition object as the perception result.

In some embodiments, the output module 54 is specifically configured to output, as the sensing result, the recognition result corresponding to the weighted second confidence coefficient if the weighted second confidence coefficient is greater than or equal to a second threshold.

In some embodiments, the second confidence coefficient obtaining sub-module is specifically configured to normalize the first confidence coefficient corresponding to each scene in the classification result to obtain a processed first confidence coefficient; and for each recognition object, weighting the second confidence coefficient of the recognition object in each scene and the processed first confidence coefficient corresponding to each scene to obtain the weighted second confidence coefficient of the recognition object.

In some embodiments, the classification result acquisition module 52 includes:

and the interest region extraction submodule is used for extracting the interest region from the image information through a preset rule.

The classification result obtaining sub-module is used for inputting the region of interest into a pre-trained scene classification model and obtaining a classification result output by the scene classification model.

In some embodiments, the system 50 further comprises:

the sample acquisition module is used for acquiring a plurality of image samples of the identified objects calibrated in advance;

the labeling module is used for labeling the scenes of the plurality of image samples to obtain labeled image samples;

and the training module is used for training to obtain the scene classification model based on the noted image sample.

In some embodiments, the system 50 further comprises:

the quantity requirement determining module is used for determining that the image samples corresponding to each scene in the annotated image samples meet the preset quantity requirement.

In some embodiments, the system 50 further comprises:

the recognition model training module is used for training to obtain a recognition model corresponding to each scene based on the image sample corresponding to the scene for each scene in the plurality of scenes, wherein the plurality of scenes comprise a general scene and a long tail scene, and the image sample corresponding to the general scene comprises the image sample with the label of the long tail scene.

The specific manner in which the various modules perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.

Fig. 10 is a block diagram illustrating an environment-aware device 700 according to an example embodiment. As shown in fig. 10, the context awareness apparatus 700 may include: the processor 701, the memory 702, and the processor 701 and the memory 702 may be provided in the in-vehicle terminal of the environment sensing device. The context aware device 700 may also include one or more of a multimedia component 703, an input/output (I/O) interface 704, and a communication component 705.

Wherein the processor 701 is configured to control the overall operation of the environment-aware device 700 to perform all or part of the above-described environment-aware method. The memory 702 is used to store various types of data to support operation at the context awareness apparatus 700, which may include, for example, instructions for any application or method operating on the context awareness apparatus 700, as well as application related data, such as contact data, messages sent and received, pictures, audio, video, and so forth. The Memory 702 may be implemented by any type or combination of volatile or non-volatile Memory devices, such as static random access Memory (Static Random Access Memory, SRAM for short), electrically erasable programmable Read-Only Memory (Electrically Erasable Programmable Read-Only Memory, EEPROM for short), erasable programmable Read-Only Memory (Erasable Programmable Read-Only Memory, EPROM for short), programmable Read-Only Memory (Programmable Read-Only Memory, PROM for short), read-Only Memory (ROM for short), magnetic Memory, flash Memory, magnetic disk, or optical disk. The multimedia component 703 can include a screen and an audio component. Wherein the screen may be, for example, a touch screen, the audio component being for outputting and/or inputting audio signals. For example, the audio component may include a microphone for receiving external audio signals. The received audio signals may be further stored in the memory 702 or transmitted through the communication component 705. The audio assembly further comprises at least one speaker for outputting audio signals. The I/O interface 704 provides an interface between the processor 701 and other interface modules, which may be a keyboard, mouse, buttons, etc. These buttons may be virtual buttons or physical buttons. The communication component 705 is configured to provide wired or wireless communication between the context awareness apparatus 700 and other devices. Wireless communication, such as Wi-Fi, bluetooth, near field communication (Near Field Communication, NFC for short), 2G, 3G, 4G, NB-IOT, eMTC, or other 5G, etc., or one or a combination of more of them, is not limited herein. The corresponding communication component 705 may thus comprise: wi-Fi module, bluetooth module, NFC module, etc.

In an exemplary embodiment, the context awareness apparatus 700 may be implemented by one or more application specific integrated circuits (Application Specific Integrated Circuit, abbreviated as ASIC), digital signal processor (Digital Signal Processor, abbreviated as DSP), digital signal processing device (Digital Signal Processing Device, abbreviated as DSPD), programmable logic device (Programmable Logic Device, abbreviated as PLD), field programmable gate array (Field Programmable Gate Array, abbreviated as FPGA), controller, microcontroller, microprocessor, or other electronic component for performing the context awareness method described above.

In another exemplary embodiment, a computer readable storage medium is also provided comprising program instructions which, when executed by a processor, implement the steps of the context-aware method described above. For example, the computer readable storage medium may be the memory 702 including program instructions described above, which are executable by the processor 701 of the context-aware apparatus 700 to perform the context-aware method described above.

In another exemplary embodiment, there is also provided a vehicle including the environment sensing device of the above embodiment.

Fig. 11 is a block diagram illustrating a server 1900 according to an example embodiment. For example, the server 1900 may be provided as a server. Referring to fig. 11, the server 1900 includes a processor 1922, which may be one or more in number, and a memory 1932 for storing computer programs executable by the processor 1922. The computer program stored in memory 1932 may include one or more modules each corresponding to a set of instructions. Further, the processor 1922 may be configured to execute the computer program to perform the context awareness method described above.

In addition, the server 1900 may further include a power component 1926 and a communication component 1950, the power component 1926 may be configured to perform power management of the server 1900, and the communication component 1950 may be configured to enable communication of the server 1900, e.g., wired or wireless communication. In addition, the server 1900 may also include an input/output (I/O) interface 1958. The Server 1900 may operate an operating system based on a memory 1932, such as Windows Server ^TM ，Mac OS X ^TM ，Unix ^TM ，Linux ^TM Etc.

In another exemplary embodiment, a computer readable storage medium is also provided comprising program instructions which, when executed by a processor, implement the steps of the context-aware method described above. For example, the non-transitory computer readable storage medium may be the memory 1932 described above including program instructions that are executable by the processor 1922 of the server 1900 to perform the context awareness method described above.

In another exemplary embodiment, a computer program product is also provided, comprising a computer program executable by a programmable apparatus, the computer program having code portions for performing the above described context awareness method when executed by the programmable apparatus.

The preferred embodiments of the present disclosure have been described in detail above with reference to the accompanying drawings, but the present disclosure is not limited to the specific details of the above embodiments, and various simple modifications may be made to the technical solutions of the present disclosure within the scope of the technical concept of the present disclosure, and all the simple modifications belong to the protection scope of the present disclosure.

In addition, the specific features described in the foregoing embodiments may be combined in any suitable manner, and in order to avoid unnecessary repetition, the present disclosure does not further describe various possible combinations.

Moreover, any combination between the various embodiments of the present disclosure is possible as long as it does not depart from the spirit of the present disclosure, which should also be construed as the disclosure of the present disclosure.

Claims

1. A method of environmental awareness, comprising:

Acquiring image information of a current environment;

2. The method according to claim 1, wherein the determining the perceived result of the current environment based on the classification result and the at least one recognition result output by the recognition model based on the image information comprises:

3. The method according to claim 2, wherein outputting the sensing result if the sensing result satisfies a preset condition comprises:

4. The method according to claim 1, wherein the determining the perceived result of the current environment based on the classification result and the at least one recognition result output by the recognition model based on the image information comprises:

5. The method of claim 4, wherein outputting the perception result if the perception result satisfies a preset condition comprises:

6. The method according to claim 4, wherein for each recognition object, weighting the second confidence coefficient of the recognition object under each scene and the first confidence coefficient corresponding to each scene to obtain the weighted second confidence coefficient of the recognition object, includes:

7. The method according to any one of claims 1 to 6, wherein inputting the image information into a pre-trained scene classification model and obtaining a classification result output by the scene classification model comprises:

8. The method according to any one of claims 1 to 6, further comprising:

9. The method of claim 8, further comprising, prior to said training the scene classification model based on the annotated image samples:

10. The method according to any one of claims 1 to 6, further comprising:

and training to obtain an identification model corresponding to each scene based on the image sample corresponding to the scene, wherein the scenes comprise a general scene and a long tail scene, and the image sample corresponding to the general scene comprises the image sample with the label of the long tail scene.

11. A non-transitory computer readable storage medium having stored thereon a computer program, characterized in that the program when executed by a processor realizes the steps of the method according to any of claims 1 to 10.

12. An environmental awareness apparatus, comprising:

a memory having a computer program stored thereon;

a processor for executing the computer program in the memory to implement the steps of the method of any one of claims 1 to 10.

13. A vehicle, characterized by comprising: the context awareness apparatus of claim 12.