CN113239901A

CN113239901A - Scene recognition method, device, equipment and storage medium

Info

Publication number: CN113239901A
Application number: CN202110674250.3A
Authority: CN
Inventors: 李潇; 丁曙光; 杜挺; 袁克彬; 任冬淳
Original assignee: Beijing Sankuai Online Technology Co Ltd
Current assignee: Beijing Sankuai Online Technology Co Ltd
Priority date: 2021-06-17
Filing date: 2021-06-17
Publication date: 2021-08-10
Anticipated expiration: 2041-06-17
Also published as: CN113239901B

Abstract

The application provides a scene recognition method, a scene recognition device, scene recognition equipment and a storage medium, and belongs to the technical field of computers. The method comprises the following steps: calling a scene feature extraction network and a scene prediction network, and performing scene prediction based on a first scene sequence of a first driving scene to obtain a second scene sequence; training a scene feature extraction network and a scene prediction network based on the second scene sequence and a third scene sequence of the first driving scene; calling the trained scene feature extraction network and the scene classification network, and carrying out scene classification based on a scene sequence of a second driving scene to obtain a prediction class label; training a scene classification network based on the scene class label and the prediction class label of the second driving scene; and acquiring a scene recognition model, wherein the scene recognition model comprises a trained scene feature extraction network and a trained scene classification network. The method can acquire the scene recognition model, and the scene recognition is carried out through the scene recognition model, so that the accuracy of the scene recognition can be improved.

Description

Scene recognition method, device, equipment and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method, an apparatus, a device, and a storage medium for scene recognition.

Background

Scene understanding plays an important role in the field of autonomous driving. Scene understanding refers to recognizing different driving scenes, such as driving scenes of overtaking, meeting, following, and the like. Only if accurate scene understanding is achieved, the driving strategy can be determined in a targeted mode, and safety of automatic driving is guaranteed.

In the related art, the scene type of the driving scene is generally determined by a condition, that is, whether the scene data corresponding to the driving scene meets the condition corresponding to a certain scene type is determined, and if yes, the scene type is determined as the scene type of the driving scene. However, since the conditions for performing scene recognition are artificially designed, the scene recognition is limited by human experience, and the accuracy of scene recognition is low.

Disclosure of Invention

The embodiment of the application provides a scene recognition method, a scene recognition device and a storage medium, wherein a scene recognition model can be obtained, and scene recognition can be performed through the scene recognition model, so that the accuracy of scene recognition can be improved. The technical scheme is as follows:

in one aspect, a scene recognition method is provided, and the method includes:

calling a scene feature extraction network and a scene prediction network, and performing scene prediction based on a first scene sequence of a first driving scene to obtain a second scene sequence of the first driving scene, wherein the first scene sequence comprises sample scene data corresponding to at least one moment in a first time period, the second scene sequence comprises predicted scene data corresponding to at least one moment in a second time period, and the first time period is earlier than the second time period;

training the scene feature extraction network and the scene prediction network based on the second scene sequence and a third scene sequence of the first driving scene, the third scene sequence including sample scene data corresponding to at least one moment in the second time period;

calling the trained scene feature extraction network and scene classification network, and carrying out scene classification based on a scene sequence of a second driving scene to obtain a prediction class label;

training the scene classification network based on the scene class label and the prediction class label of the second driving scene;

and acquiring a scene recognition model, wherein the scene recognition model comprises the trained scene feature extraction network and the trained scene classification network.

In a possible implementation manner, the invoking a scene feature extraction network and a scene prediction network, performing scene prediction based on a first scene sequence of a first driving scene, and obtaining a second scene sequence of the first driving scene includes:

calling the scene feature extraction network, and performing feature extraction on the first scene sequence to obtain a first scene feature;

and calling the scene prediction network, and performing scene prediction based on the first scene characteristics to obtain the second scene sequence.

In one possible implementation, the training the scene feature extraction network and the scene prediction network based on the second scene sequence and the third scene sequence of the first driving scene includes:

determining a first loss value based on the second scene sequence and the third scene sequence, wherein the first loss value is used for representing the similarity between the second scene sequence and the third scene sequence;

training the scene feature extraction network and the scene prediction network based on the first loss value.

In a possible implementation manner, the invoking the trained scene feature extraction network and the scene classification network, and performing scene classification based on a scene sequence of a second driving scene to obtain a prediction class label includes:

calling the trained scene feature extraction network, and performing feature extraction on the scene sequence of the second driving scene to obtain second scene features;

and calling the scene classification network, and carrying out scene classification based on the second scene characteristics to obtain the prediction class label.

In one possible implementation, the training the scene classification network based on the scene class label and the prediction class label of the second driving scene includes:

determining a second loss value based on the scene category label and the prediction category label, wherein the second loss value is used for representing the similarity between the scene category label and the prediction category label;

training the scene classification network based on the second loss value.

In a possible implementation manner, in the first scene sequence, the sample scene data corresponding to any time includes: first state data corresponding to the time of the autonomous vehicle in the first driving scene, second state data corresponding to target vehicles around the autonomous vehicle at the time, and relative position data of the target vehicles with the autonomous vehicle as a reference.

In a possible implementation manner, the sample scene data corresponding to any time further includes a target image, and the method further includes:

drawing a target image corresponding to the time based on the first state data, the second state data and the relative position data corresponding to the time, wherein the target image is used for representing the postures of the automatic driving vehicle and the target vehicle and the relative position relation of the automatic driving vehicle and the target vehicle at the time.

In one possible implementation, the method further includes:

for any moment, dividing a target area where the automatic driving vehicle is located into 9 grids so that the automatic driving vehicle is located in the middle grid of the target area;

determining grid identifications of the 9 grids;

determining a first grid identification of a grid where the autonomous vehicle is located and a second grid identification of a grid where the target vehicle is located;

determining the first grid identification and the second grid identification as the relative position data.

In one possible implementation, the target vehicle is a target number of vehicles around the autonomous vehicle, the target vehicle being a distance from the autonomous vehicle that is less than distances of other vehicles in the first driving scenario from the autonomous vehicle.

In one possible implementation, the method further includes:

and calling the scene recognition model to recognize the scene category of any driving scene.

In a possible implementation manner, the invoking the scene recognition model to recognize a scene category of any driving scene includes:

acquiring a scene sequence of any driving scene;

and calling the scene recognition model, and carrying out scene classification based on the scene sequence of the driving scene to obtain a scene category label of the driving scene.

In another aspect, a scene recognition apparatus is provided, the apparatus including:

a scene prediction module, configured to invoke a scene feature extraction network and a scene prediction network, perform scene prediction based on a first scene sequence of a first driving scene to obtain a second scene sequence of the first driving scene, where the first scene sequence includes sample scene data corresponding to at least one time within a first time period, the second scene sequence includes predicted scene data corresponding to at least one time within a second time period, and the first time period is earlier than the second time period;

a first training module configured to train the scene feature extraction network and the scene prediction network based on the second scene sequence and a third scene sequence of the first driving scene, the third scene sequence including sample scene data corresponding to at least one time within the second time period;

the scene classification module is configured to call the trained scene feature extraction network and scene classification network, and perform scene classification based on a scene sequence of a second driving scene to obtain a prediction category label;

a second training module configured to train the scene classification network based on the scene class label and the prediction class label of the second driving scene;

a model obtaining module configured to obtain a scene recognition model, where the scene recognition model includes the trained scene feature extraction network and the trained scene classification network.

In a possible implementation manner, the scene prediction module is configured to invoke the scene feature extraction network, and perform feature extraction on the first scene sequence to obtain a first scene feature; and calling the scene prediction network, and performing scene prediction based on the first scene characteristics to obtain the second scene sequence.

In a possible implementation manner, the first training module is configured to determine a first loss value based on the second scene sequence and the third scene sequence, where the first loss value is used to represent a similarity between the second scene sequence and the third scene sequence; training the scene feature extraction network and the scene prediction network based on the first loss value.

In a possible implementation manner, the scene classification module is configured to invoke the trained scene feature extraction network, and perform feature extraction on a scene sequence of the second driving scene to obtain a second scene feature; and calling the scene classification network, and carrying out scene classification based on the second scene characteristics to obtain the prediction class label.

In a possible implementation, the second training module is configured to determine a second loss value based on the scene class label and the prediction class label, where the second loss value is used to represent a similarity between the scene class label and the prediction class label; training the scene classification network based on the second loss value.

In a possible implementation manner, the sample scene data corresponding to any time further includes a target image, and the apparatus further includes: an image acquisition module configured to draw a target image corresponding to the time based on the first state data, the second state data and the relative position data corresponding to the time, the target image being used to represent the posture of the autonomous vehicle and the target vehicle and the relative position relationship of the autonomous vehicle and the target vehicle at the time.

In one possible implementation, the apparatus further includes:

a position data acquisition module configured to divide a target area where the autonomous vehicle is located into 9 grids for the any one time so that the autonomous vehicle is located in a grid in the middle of the target area; determining grid identifications of the 9 grids; determining a first grid identification of a grid where the autonomous vehicle is located and a second grid identification of a grid where the target vehicle is located; determining the first grid identification and the second grid identification as the relative position data.

In one possible implementation, the apparatus further includes:

and the scene recognition module is configured to call the scene recognition model and recognize the scene category of any driving scene.

In one possible implementation, the scene recognition module is configured to acquire a scene sequence of any driving scene; and calling the scene recognition model, and carrying out scene classification based on the scene sequence of the driving scene to obtain a scene category label of the driving scene.

In another aspect, a computer device is provided, which includes a processor and a memory, where at least one program code is stored, and the program code is loaded by the processor and executed to implement the operations performed in the scene recognition method in any one of the above possible implementation manners.

In another aspect, a computer-readable storage medium is provided, in which at least one program code is stored, and the program code is loaded and executed by a processor to implement the operations performed in the scene recognition method in any one of the above possible implementation manners.

In another aspect, a computer program product is provided, where the computer program product includes at least one program code, and the program code is loaded and executed by a processor to implement the operations performed in the scene recognition method in any one of the above possible implementation manners.

The beneficial effects brought by the technical scheme provided by the embodiment of the application at least comprise:

in the embodiment of the application, the category of the driving scene is not identified through a manually designed condition, but the scene identification model is trained through the sample scene data, so that the scene identification model learns the scene characteristics of different driving scenes, and thus different driving scenes can be identified, the limitation of human experience can be eliminated, and the accuracy of scene identification is improved. And when training the scene recognition model, the scene feature extraction network and the scene classification network in the scene recognition model are trained in two stages, in the first training stage, the scene feature extraction network is trained through the data of the first driving scene without labeling, so that the scene recognition model can extract the scene features, in the second training stage, the scene classification network is trained through the data of the second driving scene with the scene class label to be labeled, so that the scene recognition model can perform scene classification based on the scene features, the training effect of the scene recognition model can be ensured, and the labeling amount of the training data can be reduced.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic illustration of an implementation environment provided by an embodiment of the present application;

fig. 2 is a flowchart of a scene recognition method according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a target area in which an autonomous vehicle is located according to an embodiment of the present disclosure;

FIG. 4 is a schematic diagram of a target image provided by an embodiment of the present application;

FIG. 5 is a schematic diagram of a target image provided by an embodiment of the present application;

FIG. 6 is a schematic diagram of a target image provided by an embodiment of the present application;

FIG. 7 is a schematic diagram of a training process of a scene recognition model according to an embodiment of the present application;

FIG. 8 is a schematic diagram of a data processing procedure of a first training phase provided by an embodiment of the present application;

FIG. 9 is a diagram illustrating a data processing procedure of a second training phase according to an embodiment of the present disclosure;

fig. 10 is a block diagram of a scene recognition apparatus according to an embodiment of the present application;

fig. 11 is a schematic structural diagram of a terminal according to an embodiment of the present application;

fig. 12 is a schematic structural diagram of a server according to an embodiment of the present application.

Detailed Description

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

The terms "first," "second," "third," "fourth," and the like as used herein may be used herein to describe various concepts, but these concepts are not limited by these terms unless otherwise specified. These terms are only used to distinguish one concept from another. For example, a first scene sequence may be referred to as a scene sequence, and similarly, a second scene sequence may be referred to as a first scene sequence, without departing from the scope of the present application.

As used herein, the terms "at least one," "a plurality," "each," and "any," at least one of which includes one, two, or more than two, and a plurality of which includes two or more than two, each of which refers to each of the corresponding plurality, and any of which refers to any of the plurality. For example, the plurality of time instants includes 3 time instants, each of the 3 time instants refers to each of the 3 time instants, and any one of the 3 time instants refers to any one of the 3 time instants, which may be a first time instant, a second time instant, or a third time instant.

Fig. 1 is a schematic diagram of an implementation environment provided by an embodiment of the present application. Referring to fig. 1, the implementation environment includes a terminal 101 and a server 102. The terminal 101 and the server 102 are connected via a wireless or wired network. Optionally, the terminal 101 is a computer, a mobile phone, a tablet computer, or other terminal. Optionally, the server 102 is a background server or a cloud server providing services such as cloud computing and cloud storage.

Optionally, the terminal 101 has installed thereon a target application served by the server 102, and the terminal 101 can implement functions such as data transmission, message interaction, and the like through the target application. Optionally, the target application is a target application in an operating system of the terminal 101, or a target application provided by a third party. The target application has a scene recognition function, namely, the target application can recognize the category of the driving scene. Alternatively, of course, the target application can also have other functions, such as performing simulation tests based on driving scenarios, and the like. Optionally, the target application is a simulation application or other applications, which is not limited in this embodiment of the application.

In the embodiment of the application, the terminal 101 or the server 102 is configured to train a scene recognition model, and perform scene recognition through the trained scene recognition model. For example, after the terminal 101 or the server 102 obtains a scene recognition model by training, the terminal 101 or the server 102 shares the scene recognition model, so that both the terminal 101 and the server 102 can perform scene recognition by the scene recognition model. Or after the server 102 trains the scene recognition model, the scene recognition model is not shared with the terminal 101, when the terminal 101 needs to determine the scene type of a certain driving scene, the scene sequence of the driving scene is uploaded to the server 102, the server 102 performs scene recognition based on the scene sequence, and the recognized scene type is returned to the terminal 101.

It should be noted that the embodiment of the present application is described by taking an example in which the implementation environment includes only the terminal 101 and the server 102, and in other embodiments, the implementation environment includes only the terminal 101 or the server 102. Training of the scene recognition model and scene recognition are realized by the terminal 101 or the server 102.

The scene recognition method provided by the application can be applied to an automatic driving scene, for example, after the server obtains the scene recognition model through the training of the method provided by the application, the scene recognition model is sent to a terminal in the automatic driving vehicle, in the driving process of the automatic driving vehicle, the terminal calls the stored scene recognition model to recognize the current driving scene in real time, and the driving strategy is determined based on the scene recognition result. For another example, after the server trains the scene recognition model, the calling interface of the scene recognition model is provided for the terminal in the automatic driving vehicle, in the driving process of the automatic driving vehicle, the terminal calls the scene recognition model for scene recognition by calling the calling interface provided by the server in real time, and the driving strategy is determined based on the scene recognition result. Of course, the scene recognition method provided by the embodiment of the present application can also be applied to other scenes, and the embodiment of the present application does not limit this.

Fig. 2 is a flowchart of a scene identification method according to an embodiment of the present application. Referring to fig. 2, the embodiment includes:

201. and calling a scene feature extraction network and a scene prediction network by the server, and predicting the scene based on a first scene sequence of the first driving scene to obtain a second scene sequence of the first driving scene.

The scene feature extraction network is used for extracting scene features in scene data of a driving scene. The scene prediction network is used for predicting the scene data corresponding to the driving scene at the future moment based on the scene characteristics of the scene data corresponding to the driving scene at the historical moment.

The driving scene includes scene elements, and the scene elements are objects constituting the driving scene. The scene elements include dynamic scene elements and static scene elements. The dynamic scene element refers to an element that can move in the scene, for example, a pedestrian, a vehicle, and the like on the road. Static scene elements refer to elements that cannot move in the driving scene, such as roadblocks, trees, etc. The scene element has corresponding state data indicating a state of the scene element. For example, for a scene element such as a vehicle, the state data includes data of speed, acceleration, direction, position, and the like.

The driving scenario corresponds to a time period, and at different times in the time period, the state data of the scenario elements in the driving scenario may change, for example, the speed of a certain vehicle in the driving scenario is 50 km/h at the previous time, and the speed of the vehicle becomes 60 km/h at the current time. As another example, at the previous time, another vehicle in the driving scene and the autonomous vehicle respectively travel in two lanes of the dual lane, while at the present time, the other vehicle merges from the front of the autonomous vehicle into the lane in which the autonomous vehicle is located.

In the embodiment of the present application, a sequence of scene data corresponding to a driving scene at least one time within a time period is referred to as a scene sequence corresponding to the driving scene in the time period. Correspondingly, the scene sequence corresponding to the driving scene in a certain time period comprises: and scene data corresponding to the driving scene at least one moment in the time period. The first scene sequence includes sample scene data corresponding to the first driving scene at least one time within the first time period. For example, the first scene sequence includes sample scene data corresponding to the first driving scene at three moments in time within the first time period. The second scene sequence comprises predicted scene data corresponding to the first driving scene at least one moment in a second time period. For example, the second scene sequence includes predicted scene data corresponding to the first driving scene at two times within the second time period. Wherein the first time period is earlier than the second time period. The sample scene data is actual scene data of the first driving scene, and the predicted scene data is the scene data of the first driving scene predicted by the scene feature extraction network and the scene prediction network, and is not the actual scene data of the first driving scene.

In a possible implementation manner, in the first scene sequence, the sample scene data corresponding to any time includes: first state data corresponding to the autonomous vehicle at the time in the first driving scenario, second state data corresponding to target vehicles around the autonomous vehicle at the time, and relative position data of the target vehicle at the time, the relative position data being referenced to the autonomous vehicle.

Optionally, the first status data comprises a speed, acceleration, direction, position, etc. of the autonomous vehicle at the time. Optionally, the second state data includes speed, acceleration, direction, position, etc. of the target vehicle at the time. Wherein the speed includes a lateral speed and a longitudinal speed. The acceleration includes lateral acceleration and longitudinal acceleration. The lateral direction is a direction perpendicular to the vehicle traveling direction, and the longitudinal direction is the vehicle traveling direction. Alternatively, the relative position data of the target vehicle may be in any form, for example, if the front left, front right, left, rear right, or right of the autonomous vehicle is used as the position of the target vehicle with reference to the autonomous vehicle, the position of the target vehicle is the relative position data with reference to the autonomous vehicle. For another example, the position of the autonomous vehicle is taken as the center position of the dial, and one of the directions from 1 o 'clock to 12 o' clock of the autonomous vehicle is taken as the position of the target vehicle, so that the position of the target vehicle is the relative position data with the autonomous vehicle as a reference.

In one possible implementation, the area where the autonomous vehicle is located is divided into a plurality of grids, and relative position data is determined based on grid identification, that is, for any time, the server divides the target area where the autonomous vehicle is located into 9 grids, so that the autonomous vehicle is located in the middle of the target area; determining grid identifications of 9 grids; determining a first grid identifier of a grid where the automatic driving vehicle is located and a second grid identifier of the grid where the target vehicle is located; the first grid identification and the second grid identification are determined as the relative position data.

Referring to fig. 3, the area where the autonomous vehicle is located is divided into 9 grids, wherein a first grid of the grid where the autonomous vehicle is located is identified as numeral 5, and when a second grid of the grid where a target vehicle is located is identified as 6, 5 and 6 are determined as relative position data of the target vehicle with the autonomous vehicle as a reference. Wherein, at each moment, the server determines the grid identifications of the 9 grids in an unchanged manner. For example, with continued reference to fig. 3, the server determines the grid id of the middle grid of the target area as 1 and the grid id of the adjacent grid of the middle grid in the lane direction as 2, i.e., determines the grid ids of the 9 grids in the manner shown in fig. 3, and then determines the grid ids in the manner shown in fig. 3 for the 9 grids in the target area where the autonomous vehicle is located at each time. Since the determination manner of the grid identifications of the 9 grids in the target area is unchanged at each moment, the grid identifications corresponding to the autonomous vehicle and the target vehicle can indicate the relative positions of the target vehicle with the autonomous vehicle as a reference.

With continued reference to fig. 3, the grid symbol 5 in which the autonomous vehicle is located has position coordinates (ego _ starts _ s, ego _ end _ s, ego _ start _ l, ego _ end _ l) corresponding to four coordinate values, and then with reference to the position coordinates, the position coordinates of the target vehicle corresponding to each grid symbol in the target area can be defined by the following formula (1).

Wherein the content of the first and second substances,

indicating the position coordinates of the i-th target vehicle at time t,

a first coordinate value representing the ith target vehicle,

a second coordinate value indicating the ith target vehicle,

a third coordinate value indicating the ith target vehicle,

a fourth coordinate value representing the ith target vehicle,

a first coordinate value representing the autonomous vehicle,

a second coordinate value representing the autonomous vehicle,

a third coordinate value representing the autonomous vehicle,

a fourth coordinate value representing the autonomous vehicle.

In the embodiment of the application, the target area where the target vehicle is located is divided into 9 grids, the grid identifications of the automatic driving vehicle and the grids where the target vehicle is located are used for representing the relative position data of the target vehicle, which is based on the automatic driving vehicle as a reference, the representation form of the relative position data is simple and easy to implement, and the relative position relationship between the automatic driving vehicle and the target vehicle can be clearly indicated.

In one possible implementation, the target vehicle is a target number of vehicles around the autonomous vehicle that are less in distance from the autonomous vehicle than other vehicles in the first driving scenario. Alternatively, the target number is any number, such as 8.

In the embodiment of the application, since the category of the driving scene depends on the interactive behaviors of the autonomous vehicle and the scene elements in the driving scene, and the interactive behaviors of the vehicles closer to the autonomous vehicle and the autonomous vehicle in the driving scene are more, the target number of vehicles closer to the autonomous vehicle in the first driving scene are determined as the target vehicles, and the determined scene category of the driving scene is more accurate based on the state data and the relative position data of the target vehicles.

In a possible implementation manner, the sample scene data corresponding to any time further includes a target image, and the acquiring process of the target image includes: the server draws a target image corresponding to the time based on the first state data, the second state data and the relative position data corresponding to the time, the target image being used for representing the postures of the automatic driving vehicle and the target vehicle and the relative position relation of the automatic driving vehicle and the target vehicle at the time.

Alternatively, the target image is an image of any form. Optionally, the target image is a top view. Referring to fig. 4, fig. 4 is a schematic view of a target image. The target image comprises three vehicles, wherein the automatic driving vehicle runs on one lane of the two lanes, one target vehicle runs on the other lane of the two lanes, and the target vehicle in front of the automatic driving vehicle spans the two lanes to indicate that the target vehicle is changing lanes. Optionally, the target image is a bird's eye view. Referring to fig. 5, fig. 5 is a schematic view of a target image. The target image comprises 5 vehicles, wherein 4 target vehicles are parked in the garage in parallel, and the automatic driving vehicle backs up and enters the garage. Referring to fig. 6, fig. 6 is a schematic view of a target image. The target image comprises two vehicles, the automatic driving vehicle and the target vehicle run on the same lane, the automatic driving vehicle runs behind the target vehicle, the direction of the automatic driving vehicle deviates to the other lane, and the automatic driving vehicle is indicated to be in lane changing and overtaking.

In the embodiment of the application, the target image is drawn through the state data and the relative position data of the automatic driving vehicle and the target vehicle at any moment, the posture of the automatic driving vehicle and the target vehicle and the relative position relation of the automatic driving vehicle and the target vehicle are represented by the target image, then the target image is used as one item of data in the sample scene data, the data form and the data content of the sample scene data are enriched, and the scene recognition accuracy can be improved.

Alternatively, with x_iX represents a first scene sequence, image represents a target image, state represents state data and relative position data of a target vehicle around an autonomous vehicle, and action represents state data of the autonomous vehicle_i＝{(image,{state,action})₁，(image,{state,action})₂，...，(image,{state,action})_tWherein the first scene sequence includes t time instantsCorresponding sample scene data (image, { state, action })₁Sample scene data (image, state, action) indicating that the first scene sequence corresponds to the first time within the first time slot₂Sample scene data (image, state, action) indicating the second time of the first scene sequence_tSample scene data corresponding to the first scene sequence at the t-th time is represented. Optionally, n represents the number of first driving scenarios, and the server trains the scenario recognition model through the scenario sequence of the plurality of first driving scenarios, then training set D₁Can be represented as D₁＝{X₁,…,X_n}。

In the embodiment of the application, the scene sequence used for training the scene recognition model includes sample scene data corresponding to a plurality of moments with time sequence relations, and the sample scene data includes state data of the autonomous vehicle, state data of target vehicles around the autonomous vehicle, and relative position data of the target vehicles and the autonomous vehicle, so that the scene features extracted by the scene recognition model take into account the behaviors of the autonomous vehicle and the target vehicles around and the position changes of the autonomous vehicle and the target vehicles around under the plurality of moments, and the interaction situation of the autonomous vehicle and the target vehicles around can be accurately reflected, therefore, the scene recognition is performed according to the scene features, and the accuracy of the scene recognition can be ensured.

In a possible implementation manner, the method for obtaining a second scene sequence of a first driving scene by a server invoking a scene feature extraction network and a scene prediction network and performing scene prediction based on the first scene sequence of the first driving scene includes: the server calls a scene feature extraction network to extract features of the first scene sequence to obtain first scene features; and calling a scene prediction network, and performing scene prediction based on the first scene characteristics to obtain a second scene sequence.

Optionally, the scene feature extraction network is an encoder, and the scene prediction network is a decoder. The structure of the encoder and decoder is arbitrary. For example, the encoder and the decoder are both LSTM (Long Short-Term Memory), or the encoder and the decoder are both MLP (Multi-Layer Perceptron), or the encoder is a transform (a neural network model) and the decoder is LSTM, which is not limited in this embodiment.

202. The server trains a scene feature extraction network and a scene prediction network based on the second scene sequence and a third scene sequence of the first driving scene.

The third scene sequence includes sample scene data corresponding to at least one time instant within the second time period.

In one possible implementation manner, the training of the scene feature extraction network and the scene prediction network by the server based on the second scene sequence and the third scene sequence of the first driving scene includes: the server determines a first loss value based on the second scene sequence and the third scene sequence, wherein the first loss value is used for representing the similarity between the second scene sequence and the third scene sequence; based on the first loss value, a scene feature extraction network and a scene prediction network are trained.

The first loss value and the similarity between the second scene sequence and the third scene sequence are in a negative correlation relationship, and the smaller the first loss value is, the greater the similarity between the second scene sequence and the third scene sequence is, that is, the higher the prediction accuracy of the scene feature extraction network and the scene prediction network is. Correspondingly, based on the first loss value, the server trains the scene feature extraction network and the scene prediction network in the following implementation modes: the server adjusts model parameters of the scene feature extraction network and the scene prediction network so that a first loss value determined based on the adjusted scene feature extraction network and the adjusted scene prediction network becomes smaller. Optionally, when the first loss value is smaller than the reference loss value, the server determines that the feature extraction network and the scene prediction network are trained completely.

Alternatively, the first loss value is determined by the following formula (2).

Wherein Loss is a first Loss value, a_tFor the state data of the autonomous vehicle in the third sequence of scenarios,

for the state data of the autonomous vehicle in the second sequence of scenarios,

for the state data and relative position data, s, of the target vehicle in the second sequence of scenes_tAnd F is the state data and the relative position data of the target vehicle in the third scene sequence, and F is the universal number.

It should be noted that the number of the first driving scenes is one or more, and when the number of the first driving scenes is multiple, the scene data corresponding to the multiple first driving scenes are different, and the server trains the scene feature extraction network and the scene prediction network sequentially based on the scene data of each of the multiple first driving scenes, so that the accuracy of the scene feature extraction network and the scene prediction network can be improved. The method for training the scene feature extraction network and the scene prediction network through each piece of first driving scene data is the same as that of the server.

203. And the server calls the trained scene feature extraction network and the scene classification network, and carries out scene classification based on the scene sequence of the second driving scene to obtain a prediction class label.

The scene classification network is used for determining the scene category of the driving scene based on the scene characteristics of the scene data corresponding to the driving scene.

The prediction category label is used to represent a scene category of the second driving scene. The scene type is a scene type predicted by the scene feature extraction network and the scene classification network, and may be the same as or different from a scene type to which the second driving scene actually belongs.

Optionally, the prediction category tag includes a probability that the second driving scene belongs to each of a plurality of scene categories, where the scene category with the highest corresponding probability is the scene category to which the second driving scene belongs. For example, the scene categories are three, namely "parking", "narrow road passing" and "lane change overtaking", the probability corresponding to "parking" is 0.2, the probability corresponding to "narrow road passing" is "0.5", and the probability corresponding to "lane change overtaking" is 0.3, and then the scene category to which the second driving scene belongs is "narrow road passing". The plurality of scene categories are merely exemplary illustrations, and actually, the scene categories can be set according to actual situations, for example, the plurality of scene categories include meeting, giving way, passing through a traffic light, turning, and the like, which is not limited in the embodiment of the present application.

In a possible implementation manner, the server invokes the trained scene feature extraction network and the scene classification network, and performs scene classification based on the scene sequence of the second driving scene to obtain the prediction category label, including: the server calls the trained scene feature extraction network to extract features of a scene sequence of a second driving scene to obtain second scene features; and calling a scene classification network, and carrying out scene classification based on the second scene characteristics to obtain a prediction class label.

Optionally, the scene classification network is a decoder. The structure of the decoder is arbitrary. For example, the decoders are LSTM, MLP, etc., which are not limited in this application.

204. The server trains a scene classification network based on the scene class labels and the prediction class labels of the second driving scene.

The scene type label is used for representing the scene type to which the second driving scene actually belongs. Optionally, the scene category label includes a probability that the second driving scene belongs to each of a plurality of scene categories, where the scene category with the highest corresponding probability is the scene category to which the second driving scene belongs. For example, the scene types include three, that is, "parking", "narrow passage", and "lane change overtaking", and the probability corresponding to "parking" is 0, the probability corresponding to "narrow passage" is "1", and the probability corresponding to "lane change overtaking" is 0, and the scene type to which the second driving scene belongs is "narrow passage". Optionally, the scene category label is manually labeled.

In one possible implementation, the training, by the server, the scene classification network based on the scene class label and the prediction class label of the second driving scene includes: the server determines a second loss value based on the scene category label and the prediction category label, wherein the second loss value is used for representing the similarity between the scene category label and the prediction category label; based on the second loss value, training a scene classification network.

The second loss value is in a negative correlation with the similarity between the scene category label and the prediction category label, and the smaller the second loss value is, the greater the similarity between the scene category label and the prediction category label is, that is, the higher the accuracy of the scene feature extraction network and the scene classification network is. Correspondingly, based on the second loss value, the server trains the scene feature extraction network and the scene prediction network in the following implementation modes: the server adjusts the model parameters of the scene classification network such that the second loss value determined based on the adjusted scene classification network becomes smaller. Optionally, when the second loss value is smaller than the reference loss value, the server determines that the training of the scene classification network is completed.

The scene feature extraction network can accurately extract scene features through training of scene data of a first driving scene, so that the scene feature extraction network does not need to be trained again when the scene data of a second driving scene is trained, and only the scene classification network needs to be trained.

205. The server acquires a scene recognition model, wherein the scene recognition model comprises a trained scene feature extraction network and a trained scene classification network.

Through the training process, the scene feature extraction network can accurately extract the scene features in the scene data of the driving scene, and the scene classification network can accurately determine the scene category of the driving scene based on the scene features, so that the scene identification model formed by the scene feature extraction network and the scene classification network can accurately identify the scene category of the driving scene.

206. And the server calls the scene recognition model to recognize the scene category of any driving scene.

In one possible implementation manner, the server invokes a scene recognition model to recognize a scene category of any driving scene, including: the method comprises the steps that a server obtains a scene sequence of any driving scene; and calling a scene recognition model, and carrying out scene classification based on the scene sequence of the driving scene to obtain a scene category label of the driving scene.

Wherein the scene sequence of any driving scene comprises: and scene data corresponding to the driving scene at least one moment in any time period. The scene data corresponding to any time comprises state data corresponding to the automatic driving vehicle at the time in the driving scene, state data corresponding to target vehicles around the automatic driving vehicle at the time, and relative position data of the target vehicles by taking the automatic driving vehicle as a reference.

Optionally, the server invokes a scene recognition model, performs scene classification based on a scene sequence of the driving scene, and obtains a scene category label of the driving scene, including: the server calls a scene feature extraction network in the scene recognition model to extract features of a scene sequence of the driving scene to obtain scene features; and calling a scene classification network in the scene recognition model, and carrying out scene classification based on the scene characteristics to obtain a prediction class label. The prediction category label is used to represent a scene category of the driving scene identified by the scene identification model.

The scene recognition model is obtained by training based on a first training sample and a second training sample, wherein the first training sample comprises a first scene sequence and a third scene sequence of a first driving scene, the first scene sequence comprises sample scene data corresponding to at least one moment in a first time period, the third scene sequence comprises sample scene data corresponding to at least one moment in a second time period, and the first time period is earlier than the second time period. The second training sample includes a scene sequence of the second driving scenario and a scene category label of the second driving scenario.

It should be noted that, after the step 201-.

Referring to fig. 7, fig. 7 is a schematic diagram of a training process of a scene recognition model. The training process of the scene recognition model is divided into two stages, in the first training stage, a first scene sequence of a first driving scene without a scene category label is input into a scene feature extraction network, the scene feature extraction network and the scene prediction network are called, and scene prediction is carried out based on the first scene sequence to obtain a second scene sequence of the first driving scene. Then, a scene feature extraction network and a scene prediction network are trained based on the first scene sequence and the second scene sequence. In the second training stage, a scene sequence of the second driving scene marked with a scene class label is input into the trained scene feature extraction network, the trained scene feature extraction network and the trained scene classification network are called, scene classification is carried out based on the scene sequence to obtain a prediction class label, then the scene classification network is trained based on the scene class label and the prediction class label of the second driving scene, and the trained scene feature extraction network and the trained scene classification network form a trained scene recognition model.

Referring to fig. 8, fig. 8 is a schematic diagram of a data processing procedure of a scene feature extraction network and a scene prediction network in a training process of a first stage. Wherein h is_θExtracting networks for scene features, z_tFor the scene characteristics, f the scene prediction network,

for the predicted state data of the autonomous vehicle in the second sequence of scenarios,

for the second scene sequenceState data and relative position data, x, of the medium target vehicle_iThe method comprises the steps that a first scene sequence corresponding to a first driving scene in a first time period is obtained, the duration corresponding to the first time period is T, T represents the ending moment of the first time period, and T-T represents the starting moment of the first time period. The scene feature extraction network performs feature extraction on the first scene sequence to obtain scene features, and the scene prediction network performs scene prediction based on the scene features to obtain a second scene sequence. The second scene sequence includes scene data corresponding to at least one time within a second time period, the scene data including

And

referring to fig. 9, fig. 9 is a schematic diagram of a data processing procedure of the scene feature extraction network and the scene classification network in the training process of the second stage. Wherein h is_θExtracting networks for scene features, z_tAs a scene feature, g_φScene classification network, x_iAnd the scene sequence corresponds to the second driving scene in any time period, the time length corresponding to the time period is T, T represents the ending time of the time period, and T-T represents the starting time of the time period. And the scene classification network performs scene classification based on the scene characteristics to obtain a prediction category label.

All the above optional technical solutions may be combined arbitrarily to form optional embodiments of the present application, and are not described herein again.

Fig. 10 is a block diagram of a scene recognition apparatus according to an embodiment of the present application. Referring to fig. 10, the embodiment includes:

a scene prediction module 1001 configured to invoke a scene feature extraction network and a scene prediction network, perform scene prediction based on a first scene sequence of a first driving scene to obtain a second scene sequence of the first driving scene, where the first scene sequence includes sample scene data corresponding to at least one time within a first time period, the second scene sequence includes predicted scene data corresponding to at least one time within a second time period, and the first time period is earlier than the second time period;

a first training module 1002 configured to train a scene feature extraction network and a scene prediction network based on the second scene sequence and a third scene sequence of the first driving scene, the third scene sequence including sample scene data corresponding to at least one time within the second time period;

the scene classification module 1003 is configured to call the trained scene feature extraction network and scene classification network, and perform scene classification based on a scene sequence of the second driving scene to obtain a prediction category label;

a second training module 1004 configured to train a scene classification network based on the scene class labels and the predicted class labels of the second driving scene;

a model obtaining module 1005 configured to obtain a scene recognition model, where the scene recognition model includes a trained scene feature extraction network and a trained scene classification network.

In a possible implementation manner, the scene prediction module 1001 is configured to invoke a scene feature extraction network, perform feature extraction on the first scene sequence, and obtain a first scene feature; and calling a scene prediction network, and performing scene prediction based on the first scene characteristics to obtain a second scene sequence.

In one possible implementation, the first training module 1002 is configured to determine a first loss value based on the second scene sequence and the third scene sequence, where the first loss value is used to represent a similarity between the second scene sequence and the third scene sequence; based on the first loss value, a scene feature extraction network and a scene prediction network are trained.

In a possible implementation manner, the scene classification module 1003 is configured to invoke a trained scene feature extraction network, and perform feature extraction on a scene sequence of a second driving scene to obtain a second scene feature; and calling a scene classification network, and carrying out scene classification based on the second scene characteristics to obtain a prediction class label.

In one possible implementation, the second training module 1004 is configured to determine a second loss value based on the scene class label and the prediction class label, where the second loss value is used to represent a similarity between the scene class label and the prediction class label; based on the second loss value, training a scene classification network.

In a possible implementation manner, in the first scene sequence, the sample scene data corresponding to any time includes: first state data corresponding to the autonomous vehicle at a time in a first driving scenario, second state data corresponding to target vehicles around the autonomous vehicle at the time, and relative position data of the target vehicles with the autonomous vehicle as a reference.

In a possible implementation manner, the sample scene data corresponding to any time further includes a target image, and the apparatus further includes: and the image acquisition module is configured to draw a target image corresponding to the time based on the first state data, the second state data and the relative position data corresponding to the time, wherein the target image is used for representing the postures of the automatic driving vehicle and the target vehicle and the relative position relation of the automatic driving vehicle and the target vehicle at the time.

In one possible implementation, the apparatus further includes:

the position data acquisition module is configured to divide a target area where the automatic driving vehicle is located into 9 grids at any moment so that the automatic driving vehicle is located in the middle of the target area; determining grid identifications of 9 grids; determining a first grid identifier of a grid where the automatic driving vehicle is located and a second grid identifier of the grid where the target vehicle is located; the first grid identification and the second grid identification are determined as relative position data.

In one possible implementation, the apparatus further includes:

and the scene identification module is configured to call a scene identification model and identify the scene category of any driving scene.

In one possible implementation, the scene recognition module is configured to acquire a scene sequence of any driving scene; and calling a scene recognition model, and carrying out scene classification based on a scene sequence of the driving scene to obtain a scene category label of the driving scene.

It should be noted that: in the scene recognition apparatus provided in the above embodiment, only the division of the functional modules is illustrated when performing the scene recognition, and in practical applications, the functions may be distributed by different functional modules according to needs, that is, the internal structure of the computer device may be divided into different functional modules to complete all or part of the functions described above. In addition, the scene recognition device and the scene recognition method provided by the embodiments belong to the same concept, and specific implementation processes thereof are detailed in the method embodiments and are not described herein again.

The embodiment of the present application further provides a computer device, where the computer device includes a processor and a memory, where the memory stores at least one program code, and the at least one program code is loaded and executed by the processor, so as to implement the operations executed in the scene recognition method according to the foregoing embodiment.

Optionally, the computer device is provided as a terminal. Fig. 11 shows a block diagram of a terminal 1100 according to an exemplary embodiment of the present application. The terminal 1100 may be: a smart phone, a tablet computer, an MP3 player (Moving Picture Experts Group Audio Layer III, motion video Experts compression standard Audio Layer 3), an MP4 player (Moving Picture Experts Group Audio Layer IV, motion video Experts compression standard Audio Layer 4), a notebook computer, or a desktop computer. Terminal 1100 may also be referred to by other names such as user equipment, portable terminal, laptop terminal, desktop terminal, and so forth.

The terminal 1100 includes: a processor 1101 and a memory 1102.

Processor 1101 may include one or more processing cores, such as a 4-core processor, an 8-core processor, or the like. The processor 1101 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field-Programmable Gate Array), and a PLA (Programmable Logic Array). The processor 1101 may also include a main processor and a coprocessor, where the main processor is a processor for Processing data in an awake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 1101 may be integrated with a GPU (Graphics Processing Unit) that is responsible for rendering and rendering content that the display screen needs to display. In some embodiments, the processor 1101 may further include an AI (Artificial Intelligence) processor for processing computing operations related to machine learning.

Memory 1102 may include one or more computer-readable storage media, which may be non-transitory. Memory 1102 can also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in the memory 1102 is used to store at least one program code for execution by the processor 1101 to implement the scene recognition methods provided by the method embodiments herein.

In some embodiments, the terminal 1100 may further include: a peripheral interface 1103 and at least one peripheral. The processor 1101, memory 1102 and peripheral interface 1103 may be connected by a bus or signal lines. Various peripheral devices may be connected to the peripheral interface 1103 by buses, signal lines, or circuit boards. Specifically, the peripheral device includes: at least one of radio frequency circuitry 1104, display screen 1105, camera assembly 1106, audio circuitry 1107, positioning assembly 1108, and power supply 1109.

The peripheral interface 1103 may be used to connect at least one peripheral associated with I/O (Input/Output) to the processor 1101 and the memory 1102. In some embodiments, the processor 1101, memory 1102, and peripheral interface 1103 are integrated on the same chip or circuit board; in some other embodiments, any one or two of the processor 1101, the memory 1102 and the peripheral device interface 1103 may be implemented on separate chips or circuit boards, which is not limited by this embodiment.

The Radio Frequency circuit 1104 is used to receive and transmit RF (Radio Frequency) signals, also called electromagnetic signals. The radio frequency circuit 1104 communicates with communication networks and other communication devices via electromagnetic signals. The radio frequency circuit 1104 converts an electric signal into an electromagnetic signal to transmit, or converts a received electromagnetic signal into an electric signal. Optionally, the radio frequency circuit 1104 includes: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber identity module card, and so forth. The radio frequency circuit 1104 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocols include, but are not limited to: metropolitan area networks, various generation mobile communication networks (2G, 3G, 4G, and 5G), Wireless local area networks, and/or WiFi (Wireless Fidelity) networks. In some embodiments, the rf circuit 1104 may further include NFC (Near Field Communication) related circuits, which are not limited in this application.

The display screen 1105 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display screen 1105 is a touch display screen, the display screen 1105 also has the ability to capture touch signals on or over the surface of the display screen 1105. The touch signal may be input to the processor 1101 as a control signal for processing. At this point, the display screen 1105 may also be used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard. In some embodiments, display 1105 may be one, providing the front panel of terminal 1100; in other embodiments, the display screens 1105 can be at least two, respectively disposed on different surfaces of the terminal 1100 or in a folded design; in other embodiments, display 1105 can be a flexible display disposed on a curved surface or on a folded surface of terminal 1100. Even further, the display screen 1105 may be arranged in a non-rectangular irregular pattern, i.e., a shaped screen. The Display screen 1105 may be made of LCD (Liquid Crystal Display), OLED (Organic Light-Emitting Diode), and the like.

Camera assembly 1106 is used to capture images or video. Optionally, camera assembly 1106 includes a front camera and a rear camera. The front camera is arranged on the front panel of the terminal, and the rear camera is arranged on the back of the terminal. In some embodiments, the number of the rear cameras is at least two, and each rear camera is any one of a main camera, a depth-of-field camera, a wide-angle camera and a telephoto camera, so that the main camera and the depth-of-field camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize panoramic shooting and VR (Virtual Reality) shooting functions or other fusion shooting functions. In some embodiments, camera assembly 1106 may also include a flash. The flash lamp can be a monochrome temperature flash lamp or a bicolor temperature flash lamp. The double-color-temperature flash lamp is a combination of a warm-light flash lamp and a cold-light flash lamp, and can be used for light compensation at different color temperatures.

The audio circuitry 1107 may include a microphone and a speaker. The microphone is used for collecting sound waves of a user and the environment, converting the sound waves into electric signals, and inputting the electric signals to the processor 1101 for processing or inputting the electric signals to the radio frequency circuit 1104 to achieve voice communication. For stereo capture or noise reduction purposes, multiple microphones may be provided, each at a different location of terminal 1100. The microphone may also be an array microphone or an omni-directional pick-up microphone. The speaker is used to convert electrical signals from the processor 1101 or the radio frequency circuit 1104 into sound waves. The loudspeaker can be a traditional film loudspeaker or a piezoelectric ceramic loudspeaker. When the speaker is a piezoelectric ceramic speaker, the speaker can be used for purposes such as converting an electric signal into a sound wave audible to a human being, or converting an electric signal into a sound wave inaudible to a human being to measure a distance. In some embodiments, the audio circuitry 1107 may also include a headphone jack.

Positioning component 1108 is used to locate the current geographic position of terminal 1100 for purposes of navigation or LBS (Location Based Service). The Positioning component 1108 may be a Positioning component based on the united states GPS (Global Positioning System), the chinese beidou System, the russian graves System, or the european union galileo System.

Power supply 1109 is configured to provide power to various components within terminal 1100. The power supply 1109 may be alternating current, direct current, disposable or rechargeable. When the power supply 1109 includes a rechargeable battery, the rechargeable battery may support wired or wireless charging. The rechargeable battery may also be used to support fast charge technology.

In some embodiments, terminal 1100 can also include one or more sensors 1110. The one or more sensors 1110 include, but are not limited to: acceleration sensor 1111, gyro sensor 1112, pressure sensor 1113, fingerprint sensor 1114, optical sensor 1115, and proximity sensor 1116.

Acceleration sensor 1111 may detect acceleration levels in three coordinate axes of a coordinate system established with terminal 1100. For example, the acceleration sensor 1111 may be configured to detect components of the gravitational acceleration in three coordinate axes. The processor 1101 may control the display screen 1105 to display the user interface in a landscape view or a portrait view according to the gravitational acceleration signal collected by the acceleration sensor 1111. The acceleration sensor 1111 may also be used for acquisition of motion data of a game or a user.

The gyro sensor 1112 may detect a body direction and a rotation angle of the terminal 1100, and the gyro sensor 1112 may cooperate with the acceleration sensor 1111 to acquire a 3D motion of the user with respect to the terminal 1100. From the data collected by gyroscope sensor 1112, processor 1101 may implement the following functions: motion sensing (such as changing the UI according to a user's tilting operation), image stabilization at the time of photographing, game control, and inertial navigation.

Pressure sensor 1113 may be disposed on a side bezel of terminal 1100 and/or underlying display screen 1105. When the pressure sensor 1113 is disposed on the side frame of the terminal 1100, the holding signal of the terminal 1100 from the user can be detected, and the processor 1101 performs left-right hand recognition or shortcut operation according to the holding signal collected by the pressure sensor 1113. When the pressure sensor 1113 is disposed at the lower layer of the display screen 1105, the processor 1101 controls the operability control on the UI interface according to the pressure operation of the user on the display screen 1105. The operability control comprises at least one of a button control, a scroll bar control, an icon control and a menu control.

The fingerprint sensor 1114 is configured to collect a fingerprint of the user, and the processor 1101 identifies the user according to the fingerprint collected by the fingerprint sensor 1114, or the fingerprint sensor 1114 identifies the user according to the collected fingerprint. Upon recognizing that the user's identity is a trusted identity, the user is authorized by the processor 1101 to perform relevant sensitive operations including unlocking the screen, viewing encrypted information, downloading software, paying for and changing settings, etc. Fingerprint sensor 1114 may be disposed on the front, back, or side of terminal 1100. When a physical button or vendor Logo is provided on the terminal 1100, the fingerprint sensor 1114 may be integrated with the physical button or vendor Logo.

Optical sensor 1115 is used to collect ambient light intensity. In one embodiment, the processor 1101 may control the display brightness of the display screen 1105 based on the ambient light intensity collected by the optical sensor 1115. Specifically, when the ambient light intensity is high, the display brightness of the display screen 1105 is increased; when the ambient light intensity is low, the display brightness of the display screen 1105 is reduced. In another embodiment, processor 1101 may also dynamically adjust the shooting parameters of camera assembly 1106 based on the ambient light intensity collected by optical sensor 1115.

A proximity sensor 1116, also referred to as a distance sensor, is provided on the front panel of terminal 1100. Proximity sensor 1116 is used to capture the distance between the user and the front face of terminal 1100. In one embodiment, when the proximity sensor 1116 detects that the distance between the user and the front face of the terminal 1100 is gradually decreased, the display screen 1105 is controlled by the processor 1101 to switch from a bright screen state to a dark screen state; when the proximity sensor 1116 detects that the distance between the user and the front face of the terminal 1100 becomes progressively larger, the display screen 1105 is controlled by the processor 1101 to switch from a breath-screen state to a light-screen state.

Those skilled in the art will appreciate that the configuration shown in fig. 11 does not constitute a limitation of terminal 1100, and may include more or fewer components than those shown, or may combine certain components, or may employ a different arrangement of components.

Optionally, the computer device is provided as a server. Fig. 12 is a schematic structural diagram of a server 1200 according to an embodiment of the present application, where the server 1200 may generate a relatively large difference due to a difference in configuration or performance, and may include one or more processors (CPUs) 1201 and one or more memories 1202, where at least one program code is stored in the memory 1202, and the at least one program code is loaded and executed by the processors 1201 to implement the scene recognition method provided by each method embodiment. Of course, the server may also have components such as a wired or wireless network interface, a keyboard, and an input/output interface, so as to perform input/output, and the server may also include other components for implementing the functions of the device, which are not described herein again.

The embodiment of the present application further provides a computer-readable storage medium, where at least one program code is stored in the computer-readable storage medium, and the at least one program code is loaded and executed by a processor to implement the operations performed in the scene recognition method according to the foregoing embodiment.

The embodiment of the present application further provides a computer program, where at least one program code is stored in the computer program, and the at least one program code is loaded and executed by a processor, so as to implement the operations executed in the scene identification method according to the foregoing embodiment.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The above description is only exemplary of the present application and should not be taken as limiting, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A method for scene recognition, the method comprising:

2. The method of claim 1, wherein invoking the scene feature extraction network and the scene prediction network to perform scene prediction based on a first scene sequence of a first driving scene to obtain a second scene sequence of the first driving scene comprises:

3. The method of claim 1, wherein training the scene feature extraction network and the scene prediction network based on the second scene sequence and a third scene sequence of the first driving scene comprises:

4. The method of claim 1, wherein the invoking the trained scene feature extraction network and scene classification network to perform scene classification based on a scene sequence of a second driving scene to obtain a prediction class label comprises:

5. The method of claim 1, wherein training the scene classification network based on the scene class label and the prediction class label for the second driving scene comprises:

training the scene classification network based on the second loss value.

6. The method of claim 1, wherein the sample scene data corresponding to any time in the first scene sequence comprises: first state data corresponding to the time of the autonomous vehicle in the first driving scene, second state data corresponding to target vehicles around the autonomous vehicle at the time, and relative position data of the target vehicles with the autonomous vehicle as a reference.

7. The method of claim 6, wherein the sample scene data corresponding to the any one time instance further comprises a target image, the method further comprising:

8. The method of claim 6, further comprising:

determining grid identifications of the 9 grids;

9. The method of claim 6, wherein the target vehicle is a target number of vehicles around the autonomous vehicle, the target vehicle being a distance from the autonomous vehicle that is less than distances of other vehicles in the first driving scenario from the autonomous vehicle.

10. The method of claim 1, further comprising:

11. The method of claim 10, wherein said invoking said scene recognition model to recognize a scene category for any driving scene comprises:

acquiring a scene sequence of any driving scene;

12. A scene recognition apparatus, characterized in that the apparatus comprises:

13. A computer device, characterized in that the computer device comprises a processor and a memory, in which at least one program code is stored, which is loaded and executed by the processor to implement the operations performed by the scene recognition method according to any of the claims 1 to 11.

14. A computer-readable storage medium, having at least one program code stored therein, the program code being loaded and executed by a processor to perform operations performed by the scene recognition method according to any one of claims 1 to 11.