WO2022099510A1

WO2022099510A1 - Object identification method and apparatus, computer device, and storage medium

Info

Publication number: WO2022099510A1
Application number: PCT/CN2020/128125
Authority: WO
Inventors: 张磊杰
Original assignee: 深圳元戎启行科技有限公司
Priority date: 2020-11-11
Filing date: 2020-11-11
Publication date: 2022-05-19
Also published as: CN115004259A; CN115004259B

Abstract

An object identification method, comprising: acquiring a current scene image and a current scene point cloud which correspond to a target movement object (S202); performing image feature extraction on the current scene image, so as to obtain initial image features, and performing point cloud feature extraction on the current scene point cloud, so as to obtain initial point cloud features (S204); acquiring a target image position corresponding to the current scene image, and performing fusing processing on the initial image features on the basis of a point cloud feature, corresponding to the target image position, among the initial point cloud features, so as to obtain a target image feature (S206); acquiring a target point cloud position corresponding to the current scene point cloud, and performing fusing processing on the initial point cloud features on the basis of an image feature, corresponding to the target point cloud position, among the initial point cloud features, so as to obtain a target point cloud feature (S208); determining, on the basis of the target image feature and the target point cloud feature, an object position corresponding to a scene object (S210); and controlling, on the basis of the position corresponding to the scene object, the target movement object to move (S212).

Description

Object recognition method, apparatus, computer equipment and storage medium

technical field

The present application relates to an object recognition method, apparatus, computer equipment and storage medium.

Background technique

With the development of artificial intelligence, self-driving cars have appeared. Self-driving cars are intelligent cars that realize unmanned driving through computer systems. The computer system automatically and safely controls the driving of the car without the active operation of the human being. In the process of driving an autonomous vehicle, it is necessary to detect obstacles on the way and avoid obstacles in time.

However, the inventor realizes that the current methods for identifying obstacles cannot accurately identify obstacles, resulting in low obstacle avoidance capability of the autonomous vehicle, and thus low safety of the autonomous vehicle.

SUMMARY OF THE INVENTION

According to various embodiments disclosed in the present application, an object recognition method, apparatus, computer device and storage medium are provided.

An object recognition method includes:

Obtain the current scene image and the current scene point cloud corresponding to the target moving object;

Perform image feature extraction on the current scene graph to obtain initial image features, and perform point cloud feature extraction on the current scene point cloud to obtain initial point cloud features;

Obtaining the target image position corresponding to the current scene image, and performing fusion processing on the initial image feature based on the point cloud feature corresponding to the target image position in the initial point cloud feature to obtain the target image feature;

Obtaining the target point cloud position corresponding to the point cloud of the current scene, and performing fusion processing on the initial point cloud feature based on the image feature corresponding to the target point cloud position in the initial image feature to obtain the target point cloud feature;

determining the object position corresponding to the scene object based on the target image feature and the target point cloud feature; and

The target moving object is controlled to move based on the position corresponding to the scene object.

An object recognition device includes:

The current scene image acquisition module is used to acquire the current scene image and the current scene point cloud corresponding to the target moving object;

an initial point cloud feature obtaining module, used for performing image feature extraction on the current scene graph to obtain initial image features, and performing point cloud feature extraction on the current scene point cloud to obtain initial point cloud features;

The target image feature obtaining module is used to obtain the target image position corresponding to the current scene image, and based on the initial point cloud features, the point cloud features corresponding to the target image position, perform fusion processing on the initial image features, Get the target image features;

The target point cloud feature obtaining module is used to obtain the target point cloud position corresponding to the current scene point cloud, and based on the initial image features, the image features corresponding to the target point cloud position, perform the initial point cloud feature Fusion processing to obtain the target point cloud features;

a position determination module for determining an object position corresponding to a scene object based on the target image feature and the target point cloud feature; and

A motion control module, configured to control the target moving object to move based on the position corresponding to the scene object.

A computer device comprising a memory and one or more processors, the memory having computer-readable instructions stored therein, the computer-readable instructions, when executed by the processor, cause the one or more processors to execute The following steps:

One or more computer storage media storing computer-readable instructions that, when executed by one or more processors, cause the one or more processors to perform the following steps:

The details of one or more embodiments of the application are set forth in the accompanying drawings and the description below. Other features and advantages of the present application will be apparent from the description, drawings, and claims.

Description of drawings

In order to illustrate the technical solutions in the embodiments of the present application more clearly, the following briefly introduces the drawings required in the embodiments. Obviously, the drawings in the following description are only some embodiments of the present application. For those of ordinary skill in the art, other drawings can also be obtained from these drawings without any creative effort.

1 is an application scenario diagram of an object recognition method according to one or more embodiments;

2 is a schematic flowchart of an object recognition method according to one or more embodiments;

3 is a schematic flowchart of steps for obtaining target point cloud features according to one or more embodiments;

4 is a schematic diagram of an object recognition system in accordance with one or more embodiments;

5 is a block diagram of an apparatus for object recognition in accordance with one or more embodiments;

6 is a block diagram of a computer device in accordance with one or more embodiments.

Detailed ways

In order to make the technical solutions and advantages of the present application clearer, the present application will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present application, but not to limit the present application.

The object recognition method provided by this application can be applied to the application environment shown in FIG. 1 . The application environment includes a terminal 102 and a server 104, and a point cloud collection device and an image collection device are installed in the terminal 102. The point cloud collection device is used to collect point cloud data, such as the point cloud of the current scene. The image acquisition device is used to acquire images, such as the current scene image. The terminal 102 can transmit the collected current scene image and the current scene point cloud to the server 104, and the server 104 can obtain the current scene image and the current scene point cloud corresponding to the terminal 102, and the target moving object refers to the image feature of the current scene graph. Extract, obtain the initial image features, extract the point cloud features of the current scene point cloud, obtain the initial point cloud features, obtain the target image position corresponding to the current scene image, and based on the initial point cloud features, the point cloud features corresponding to the target image position, Perform fusion processing on the initial image features to obtain the target image features, obtain the target point cloud position corresponding to the current scene point cloud, and fuse the initial point cloud features based on the image features corresponding to the target point cloud position in the initial image features After processing, the target point cloud feature is obtained, the object position corresponding to the scene object is determined based on the target image feature and the target point cloud feature, and the terminal 102 is controlled to move based on the position corresponding to the scene object. The terminal 102 may be, but is not limited to, self-driving cars and mobile robots. The server 104 can be implemented by an independent server or a server cluster composed of multiple servers. The point cloud collection device can be any device that can collect point cloud data, and it can be but not limited to lidar. The image acquisition device may be any device that can acquire image data, and may be, but not limited to, a camera.

It can be understood that the above application scenario is only an example, and does not constitute a limitation on the object recognition method provided by the embodiment of the present application. The object recognition method provided by the embodiment of the present application can also be applied to other application scenarios. For example, the above object recognition method The method may be performed by the terminal 102 .

In some embodiments, as shown in FIG. 2, an object recognition method is provided, and the method is applied to the server 102 in FIG. 1 as an example for description, including the following steps:

S202: Acquire a current scene image and a current scene point cloud corresponding to the target moving object.

Specifically, a moving object refers to an object in a state of motion, which can be a living object, can be but not limited to humans and animals, or can be an inanimate object, can be but is not limited to vehicles and drones, such as Could be an autonomous vehicle. The target moving object refers to the moving object whose movement is to be controlled according to the scene image and the scene point cloud. The target moving object is, for example, the terminal 102 in FIG. 1 .

The scene image refers to the image corresponding to the scene where the moving object is located. The scene image may reflect the environment where the moving object is located, for example, the scene image may include one or more of lanes, vehicles, pedestrians or obstacles in the environment. The scene image may be acquired by an image acquisition device built into the moving object, for example, it may be acquired by a camera installed in an autonomous vehicle, or it may be acquired by an image acquisition device external to the moving object and associated with the moving object, For example, it may be acquired by an image acquisition device connected to a moving object through a connection line or a network, for example, it may be acquired by a camera connected to the autonomous vehicle via a network on the road where the autonomous driving vehicle is located. The current scene image refers to an image corresponding to the current scene where the target moving object is located at the current time. The current scene refers to the scene where the target moving object is located at the current time. The external image acquisition device can transmit the acquired scene image to the moving object.

A point cloud refers to a collection of three-dimensional data points in a three-dimensional coordinate system. For example, it can be a collection of three-dimensional data points corresponding to the surface of an object in a three-dimensional coordinate system. A point cloud can represent the outer surface of an object. surface shape. A three-dimensional data point refers to a point in a three-dimensional space, and the three-dimensional data point includes three-dimensional coordinates, and the three-dimensional coordinates may include, for example, an X coordinate, a Y coordinate, and a Z coordinate. The three-dimensional data points may also include at least one of RGB color, grayscale value, or time. The scene point cloud refers to a collection of 3D data points corresponding to the scene. The point cloud can be obtained by scanning with lidar. Among them, lidar is an active sensor. By emitting a laser beam, after hitting the laser beam on the surface of the object, the laser beam is bounced, and the bounced laser signal is collected to obtain the point cloud of the object.

The scene point cloud refers to the point cloud corresponding to the scene where the moving object is located. The scene point cloud can be collected by the built-in point cloud collection device of the moving object, for example, it can be obtained by scanning the lidar installed in the autonomous vehicle, or it can be the point cloud collection device external to the moving object and associated with the moving object. For example, it can be obtained by scanning a point cloud collection device connected to a moving object through a connecting line or network. For example, it can be obtained by scanning a lidar connected to an automatic value vehicle through a network on the road where the autonomous vehicle is located. The current scene point cloud refers to the point cloud corresponding to the current scene where the target moving object is located at the current time. The external point cloud acquisition device can transmit the scanned scene point cloud to the moving object.

In some embodiments, the target moving object can collect the current scene in real time through an image acquisition device to obtain an image of the current scene, and can collect the current scene in real time through a point cloud acquisition device to obtain a point cloud of the current scene. The target moving object can send the collected current scene image and the current scene point cloud to the server, and the server can determine the position of the obstacle on the running path of the target moving object according to the current scene image and the current scene point. The position is transmitted to the target moving object, so that the target moving object can avoid obstacles while moving.

S204, perform image feature extraction on the current scene graph to obtain initial image features, and perform point cloud feature extraction on the current scene point cloud to obtain initial point cloud features.

Specifically, the image feature (Image Feature) is used to reflect the feature of the image, and the point cloud feature (Point Feature) is used to reflect the feature of the point cloud. Image features have strong representation ability for slender objects such as pedestrians. The point cloud feature can be represented in the form of a vector, and the point cloud feature can also be called a point cloud feature vector, and the point cloud feature vector can be, for example, (a1, b1, c1). Point cloud features can also be called point features. Point cloud features have lossless representation ability for point cloud information. The image feature may be represented in the form of a vector, and the image feature may also be referred to as an image feature vector, and the image feature vector may be (a2, b2, c2), for example. The initial image features refer to image features obtained by feature extraction from the current scene image. The initial point cloud feature refers to the point cloud feature obtained by feature extraction from the current scene point cloud.

In some embodiments, the server may obtain the object recognition model, and the object recognition model may include an image feature extraction layer and a point cloud feature extraction layer. The server may input the current scene image into the image feature extraction layer, and the image feature extraction layer performs feature extraction on the current scene image, such as convolution, to obtain image features. The server can obtain the initial image features according to the image features output by the image feature extraction layer, for example, the image features output by the image feature recognition model can be used as the initial image features. The server can input the current scene point cloud into the point cloud feature extraction layer, and the point cloud feature extraction layer performs feature extraction on the current scene point cloud, such as convolution, to obtain point cloud features. The server can obtain the initial point cloud feature according to the point cloud feature output by the point cloud feature extraction layer, for example, the point cloud feature output by the point cloud feature recognition model can be used as the initial point cloud feature.

In some embodiments, the image feature extraction layer and the point cloud feature extraction layer are jointly trained. Specifically, the server can input the scene image into the image feature extraction layer, input the scene point cloud into the point cloud feature extraction layer, and obtain the predicted image features output by the image feature extraction layer and the predicted points output by the point cloud feature extraction layer. Cloud features, obtain the standard image features corresponding to the scene image, the standard image features refer to the real image features, obtain the standard point cloud features corresponding to the scene point cloud, and the standard point cloud features refer to the real point cloud features. The first loss value is determined according to the predicted image feature, for example, the first loss value is obtained according to the difference between the predicted image feature and the standard image feature. The first loss value is determined according to the predicted point cloud feature, for example, the second loss value is obtained according to the difference between the predicted point cloud feature and the standard point cloud feature. The total loss value is determined according to the first loss value and the second loss value, and the total loss value may include the first loss value and the second loss value, for example, may be the result of adding the first loss value and the second loss value. The server can use the total loss value to adjust the parameters of the image feature extraction layer and the parameters of the point cloud feature extraction layer to obtain the image feature extraction layer after training and the point cloud feature extraction layer after training.

S206: Acquire the target image position corresponding to the current scene image, and based on the point cloud features corresponding to the target image position in the initial point cloud features, perform fusion processing on the initial image features to obtain the target image features.

Specifically, the image position refers to the position of the image in the image coordinate system, and may include the corresponding positions of each pixel in the image in the image coordinate system. The image coordinate system refers to the coordinate system adopted by the image acquired by the image acquisition device, and the coordinates of each pixel in the image can be obtained according to the image coordinate system. The target image position refers to the position of each pixel in the current scene image in the image coordinate system. The image position may be determined according to the parameters of the image acquisition device, for example, the parameters of the image acquisition device may be camera parameters, and the camera parameters may include external parameters of the camera and internal parameters of the camera. The image coordinate system is a two-dimensional coordinate system, and the coordinates in the image coordinate system include abscissa and ordinate.

The point cloud feature corresponding to the target image position refers to the point cloud feature at the position in the point cloud coordinate system corresponding to the target image position in the initial point cloud feature. The position in the point cloud coordinate system corresponding to the target image position may or may not overlap with the position corresponding to the initial point cloud feature. The server can fuse the point cloud features corresponding to the overlapping positions with the initial image features to obtain the target image features. For example, the position of the target image is position A, the position in the corresponding point cloud coordinate system is position B, the position of the initial point cloud feature in the point cloud coordinate system is position C, and the overlapping part of position C and position B is position D, then The point cloud features corresponding to position D can be fused into the initial image features.

The fusion process refers to establishing an association relationship between different features at the same position in the same coordinate system, for example, establishing an association relationship between the image feature a corresponding to the position A and the point cloud feature b. The fusion process may also be to obtain fusion features including the different features according to different features at the same position in the same coordinate system, for example, according to the image feature a corresponding to the position A and the point cloud feature b to obtain the fusion feature including a and b. The fusion features can be represented in vector form.

In some embodiments, the server may obtain the position in the point cloud coordinate system corresponding to the target image position, and according to the initial point cloud feature, the point cloud feature at the position in the point cloud coordinate system corresponding to the target image position, for the initial image The features are fused to obtain the target image features. Specifically, the object recognition model may also include an image airspace fusion layer, and the server may input the initial point cloud features and the initial image features into the image airspace fusion layer, and the image airspace fusion layer may determine the position of the initial point cloud feature and the initial image feature. The coincident position between the positions is extracted, and the point cloud feature at the coincident position is extracted from the initial point cloud feature, and fused into the initial image feature to obtain the target image feature.

S208 , obtaining the target point cloud position corresponding to the point cloud of the current scene, and based on the image features corresponding to the target point cloud position in the initial image features, perform fusion processing on the initial point cloud features to obtain the target point cloud features.

specifically. The point cloud position refers to the position of the point cloud in the point cloud coordinate system, and may include the corresponding positions of each three-dimensional data point in the point cloud in the point cloud coordinate system. The coordinates corresponding to each 3D data point in the point cloud can be obtained according to the point cloud coordinate system. The target point cloud position refers to the point cloud position corresponding to each 3D data point in the current scene point cloud. The position of the point cloud may be determined according to the parameters of the point cloud collection device, and the parameters of the point cloud collection device may be, for example, the parameters of the laser radar. The point cloud coordinate system is a three-dimensional coordinate system, and the coordinates in the point cloud coordinate system may include X coordinate, Y coordinate and Z coordinate. Of course, the point cloud coordinate system can also be other types of three-dimensional coordinate systems, which are not limited here.

The image feature corresponding to the target point cloud position refers to the image feature at the position in the image coordinate system corresponding to the target point cloud position in the initial image feature. The position in the image coordinate system corresponding to the target point cloud position may or may not overlap with the position corresponding to the initial image feature. The server can fuse the image features corresponding to the overlapping positions with the initial point cloud features to obtain the target point cloud features.

In some embodiments, the server may obtain the position in the image coordinate system corresponding to the target point cloud position, and according to the initial image features, the image features at the position in the image coordinate system corresponding to the target point cloud position, for the initial point cloud feature Fusion processing is performed to obtain the target point cloud features. Specifically, the object recognition model may also include a point cloud airspace fusion layer, and the server may input the initial point cloud features and initial image features into the point cloud airspace fusion layer, and the point cloud airspace fusion layer may determine the position of the initial point cloud feature and the initial The coincident position between the positions of the image features, and extract the image features at the coincident position from the initial image features, and fuse them into the initial point cloud features to obtain the target point cloud features.

S210: Determine the object position corresponding to the scene object based on the target image feature and the target point cloud feature.

Specifically, the scene object refers to the object in the scene where the target movement object is located, and the scene object may be a living object, such as a person or an animal, or an inanimate object, such as a vehicle, a tree, etc. or stone. There can be multiple scene objects. The object position may include at least one of the position of the scene object in the current scene image or the position of the scene object in the current scene point cloud. The scene objects in the current scene image and the scene objects in the current scene point cloud may be the same, or there may be differences.

In some embodiments, the server may perform calculation according to the position of the target image feature and the position of the target point cloud feature to obtain the position of each scene object.

In some embodiments, the server may perform time series fusion of target image features obtained from different video frames to obtain fused target image features, and perform image task learning according to the fused target image features. Temporal fusion refers to concatenating image features of different frames, or concatenating point cloud features of different frames, or concatenating voxel features of different frames. The server can fuse the target point cloud features obtained from different scene point clouds in time series, obtain the fused target point cloud features, and perform point cloud task learning according to the fused target point cloud features. The server can fuse the fused target image features and the fused target point cloud features to obtain the secondary fused target image features and the secondary fused target point cloud features. Task learning, using the target point cloud features after secondary fusion to perform point cloud task learning.

S212, the target moving object is controlled to move based on the position corresponding to the scene object.

Specifically, the server can transmit the position corresponding to the scene object to the target running object, and the target running object can determine the movement route that can avoid the scene object according to the position corresponding to the scene object, and move according to the movement route, so as to avoid the scene object , to ensure safe movement.

In the above object recognition method, the current scene image and the current scene point cloud corresponding to the target moving object are obtained, image feature extraction is performed on the current scene image to obtain initial image features, and point cloud feature extraction is performed on the current scene point cloud to obtain the initial point cloud. feature, obtain the target image position corresponding to the current scene image, and fuse the initial image features based on the point cloud features corresponding to the target image position in the initial point cloud features to obtain the target image features, and obtain the target point corresponding to the current scene point cloud Cloud position, based on the image features corresponding to the target point cloud position in the initial image features, fuse the initial point cloud features to obtain the target point cloud features, and determine the object position corresponding to the scene object based on the target image features and target point cloud features , the target moving object is controlled to move based on the position corresponding to the scene object, so that the position of the scene object can be accurately obtained, so that the target moving object can avoid the movement of the scene object, and the safety of the target moving object during the movement is improved.

In some embodiments, the target image position corresponding to the current scene image is obtained, and based on the point cloud features corresponding to the target image position in the initial point cloud features, the initial image features are fused to obtain the target image features, including: according to the point cloud features The coordinate conversion relationship between the coordinate system and the image coordinate system, the target point cloud position is converted into a position in the image coordinate system, and the first conversion position is obtained; and, the first coincidence position of the first conversion position and the target image position is obtained, In the initial point cloud feature, the point cloud feature corresponding to the first coincident position is fused into the image feature corresponding to the first coincident position in the initial image feature to obtain the target image feature.

Specifically, the coordinate transformation relationship between the point cloud coordinate system and the image coordinate system refers to the transformation relationship between the coordinates in the point cloud coordinate system and the coordinates in the image coordinate system. The object corresponding to the coordinates before transformation in the point cloud coordinate system is the same as the object corresponding to the coordinates after transformation in the image coordinate system. In the following description, the coordinate transformation relationship between the point cloud coordinate system and the image coordinate system is referred to as the first transformation relationship. The coordinates of the position represented by the coordinates in the point cloud coordinate system in the image coordinate system can be determined through the first transformation relationship, that is, the image position corresponding to the target point cloud location in the image coordinate system can be determined through the first transformation relationship. For example, for coordinates (x1, y1, z1) in the point cloud coordinate system, (x1, y1, z1) can be converted into coordinates (x2, y2) in the image coordinate system through the first conversion relationship. Among them, converting coordinates in one coordinate system to coordinates in another coordinate system can be called the process of physical space projection.

The first transformation position refers to the position corresponding to the target point cloud position in the image coordinate system, and the first transformation position is the position in the two-dimensional coordinate system. The first conversion position may include the two-dimensional coordinates in the image coordinate system of the three-dimensional coordinates of all or part of the three-dimensional data points corresponding to the target point cloud position. The first coincident position refers to a position where the first conversion position coincides with the target image position. The point cloud feature corresponding to the first coincidence position refers to the point cloud feature corresponding to the position of the first coincidence position in the point cloud coordinate system. For example, the first conversion position includes (x1, y1), (x2, y2) and (x3, y3), and the target image position includes (x2, y2), (x3, y3) and (x4, y4), then the first The coincident position includes (x2, y2) and (x3, y3), if the position of (x2, y2) in the point cloud coordinate system is (x1, y1, z1), (x3, y3) The position in the point cloud coordinate system is (x2, y2, z2), the point cloud features corresponding to the first coincident position include the point cloud features corresponding to (x1, y1, z1) and the point cloud features corresponding to (x2, y2, z2).

In some embodiments, the server may stitch the point cloud feature corresponding to the first coincident position with the image feature corresponding to the first coincident position to obtain the target image feature. For example, the server may obtain the target image feature by splicing the point cloud feature corresponding to the first coincidence position to the image feature corresponding to the first coincidence position. For example, if the point cloud feature corresponding to the first coincident position is represented by vector A, and the image feature corresponding to the first coincident position is represented by vector B, the server can stitch vector B and vector A to obtain a stitched vector. The feature of the target image can be obtained from the vector, for example, the spliced vector can be used as the feature of the target image, or the feature of the target image can be obtained by processing the spliced vector.

In some embodiments, the server may convert the target image position into a position in the point cloud coordinate system according to the coordinate conversion relationship between the image coordinate system and the point cloud coordinate system, and obtain the point cloud position corresponding to the target image position. The cloud position proposes the corresponding point cloud features from the initial point cloud features, and fuses them into the initial image features to obtain the target image features.

In the above embodiment, according to the coordinate conversion relationship between the point cloud coordinate system and the image coordinate system, the target point cloud position is converted into a position in the image coordinate system, the first conversion position is obtained, and the first conversion position and the target image position are obtained. The first coincidence position of the initial point cloud feature, the point cloud feature corresponding to the first coincidence position in the initial point cloud feature is fused into the image feature corresponding to the first coincidence position in the initial image feature, and the target image feature is obtained, so that the target image feature includes Image features and point cloud features improve the richness of the features in the target image features and improve the representation ability of the target image features.

In some embodiments, the position of the target point cloud corresponding to the point cloud of the current scene is obtained, and based on the image features corresponding to the position of the target point cloud in the initial image features, the initial point cloud features are fused to obtain the target point cloud features, including: According to the coordinate conversion relationship between the image coordinate system and the point cloud coordinate system, the target image position is converted into a position in the point cloud coordinate system to obtain the second conversion position; and, the second conversion position and the target point cloud position are obtained. For two coincident positions, the image features corresponding to the second coincident positions in the initial image features are fused into the point cloud features corresponding to the second coincident positions in the initial point cloud features to obtain the target point cloud features.

Specifically, the coordinate transformation relationship between the image coordinate system and the point cloud coordinate system refers to the transformation relationship between the coordinates in the image coordinate system and the coordinates in the point cloud coordinate system. The object corresponding to the coordinates before transformation in the image coordinate system is the same as the object corresponding to the coordinates after transformation in the point cloud coordinate system. The coordinate transformation relationship between the intermediate image coordinate system and the point cloud coordinate system is described below as a second transformation relationship. The coordinates of the position represented by the coordinates in the image coordinate system in the point cloud coordinate system can be determined through the second conversion relationship.

The second conversion position refers to the position corresponding to the target image position in the point cloud coordinate system, and the second conversion position is the position in the three-dimensional coordinate system. The second conversion position may include the three-dimensional coordinates of all or part of the two-dimensional coordinates corresponding to the target image position in the point cloud coordinate system. The second coincident position refers to the position where the second transformation position coincides with the target point cloud position. The image feature corresponding to the second coincident position refers to the image feature corresponding to the two-dimensional coordinates in the image coordinate system corresponding to the second coincident position. The target point cloud feature is a feature obtained by fusing the image feature corresponding to the second coincident position into the point cloud feature corresponding to the second coincident position in the initial point cloud feature.

In some embodiments, the server may perform feature fusion between the image feature corresponding to the second coincident position and the point cloud feature corresponding to the second coincident position to obtain the target point cloud feature. Feature fusion may include one or more of arithmetic operations, combination or concatenation of features. Arithmetic operations may include one or more of addition, subtraction, multiplication or division. For example, the server may obtain the target point cloud feature by splicing the image feature corresponding to the second coincident position to the point cloud feature corresponding to the second coincident position. For example, if the point cloud feature corresponding to the second coincident position is represented by a vector C, and the image feature corresponding to the second coincident position is represented by a vector D, the server can splicing the vector C and the vector D to obtain the spliced vector. The vector obtains the target point cloud feature. For example, the spliced vector can be used as the target point cloud feature, or the spliced vector can be processed to obtain the target point cloud feature.

In some embodiments, the server may convert the position of the target point cloud into a position in the image coordinate system according to the coordinate transformation relationship between the point cloud coordinate system and the image coordinate system, and obtain the image position corresponding to the target point cloud position. The position proposes the corresponding image features from the initial image features, and fuses them into the initial point cloud features to obtain the target point cloud features. For example, the image feature at the same position as the image position can be extracted from the initial image feature, or the image feature at the position where the difference from the image position is smaller than the position difference threshold can be extracted from the initial image feature, and fused to the initial image feature. In the point cloud feature, the target point cloud feature is obtained. The position difference threshold can be set as required, or can be preset.

In the above embodiment, according to the coordinate conversion relationship between the image coordinate system and the point cloud coordinate system, the target image position is converted into a position in the point cloud coordinate system to obtain the second conversion position, and the second conversion position and the target point cloud are obtained. The second coincident position of the position, the image feature corresponding to the second coincident position in the initial image feature is fused into the point cloud feature corresponding to the second coincident position in the initial point cloud feature, and the target point cloud feature is obtained, so that the target point The cloud features include image features and point cloud features, which improves the feature richness in the target point cloud features and improves the representation ability of the target point cloud features.

In some embodiments, as shown in FIG. 3 , a second coincident position between the second conversion position and the target point cloud position is obtained, and the image features corresponding to the second coincident position in the initial image features are fused into the initial point cloud features Among the point cloud features corresponding to the second coincident position, the obtained target point cloud features include:

S302 , performing voxelization on the point cloud of the current scene to obtain a voxelization result.

Specifically, a voxel is an abbreviation for a volume element (Volume Pixel). Voxelization refers to dividing a point cloud into multiple voxels according to a given voxel size. The dimensions of each voxel in the X, Y and Z axis directions may be, for example, w, h and e, respectively. The voxels obtained by segmentation include empty voxels and non-empty voxels, empty voxels do not include points in the point cloud, and non-empty voxels include points in the point cloud. The voxelization result may include at least one of the number of voxels obtained after voxelization, the position information of the voxels, or the size of the voxels.

S304, extracting voxel features according to the voxelization result to obtain initial voxel features.

Specifically, a voxel feature is a feature used to represent a voxel. Voxel features can accelerate the convergence of the network model and simplify the complexity of the network model. The server can sample the same number of points from the voxel according to the number of points included in the voxel in the voxelization result, obtain the sampling points corresponding to the voxels, and perform feature extraction according to the sampling points corresponding to the voxels to obtain the voxels The corresponding initial voxel features. For example, according to the center coordinates of the point cloud formed by the sampling points in each voxel, and normalize the point memory center in the voxel to the center coordinates, obtain a data matrix, and input the data matrix into the trained voxel In the feature recognition model, the initial voxel features are obtained. The voxel feature recognition model refers to a model that extracts voxel features.

In some embodiments, the object recognition model further includes a voxel feature extraction layer, and the voxel feature extraction layer may be obtained by joint training with the image feature extraction layer and the point cloud feature extraction layer. The server can input the scene point cloud into the voxel feature extraction layer, and obtain the voxel feature output by the voxel feature extraction layer.

S306, obtain the second coincidence position between the second conversion position and the target point cloud position, and fuse the image features corresponding to the second coincidence position in the initial image features into the point cloud features corresponding to the second coincidence position in the initial point cloud features , get the intermediate point cloud features.

Specifically, the intermediate point cloud feature is a feature obtained by fusing the image feature corresponding to the second coincident position into the point cloud feature corresponding to the second coincident position in the initial point cloud feature.

S308: Obtain the target voxel position corresponding to the point cloud of the current scene, and convert the target voxel position to a position in the point cloud coordinate system according to the coordinate transformation relationship between the voxel coordinate system and the point cloud coordinate system, to obtain a third transformation Location.

Specifically, the voxel position refers to the position of the voxel in the voxel coordinate system. The target voxel position refers to the position of the voxel corresponding to the current scene point cloud in the voxel coordinate system. The target voxel position may include the respective positions of each voxel corresponding to the current scene point cloud in the voxel coordinate system. The coordinates of the voxel can be obtained according to the voxel coordinate system. The coordinate conversion relationship between the voxel coordinate system and the point cloud coordinate system refers to the conversion relationship between the coordinates in the voxel coordinate system and the coordinates in the point cloud coordinate system. The voxel coordinate system is a three-dimensional coordinate system. In the following description, the coordinate conversion relationship between the voxel coordinate system and the point cloud coordinate system is referred to as the third conversion relationship.

The third transformation position refers to the corresponding position of the target voxel position in the point cloud coordinate system. The third transformed position is the position in the point cloud coordinate system.

S310: Obtain a third coincident position between the third conversion position and the target voxel position, and fuse the voxel feature corresponding to the third coincident position in the initial voxel feature with the voxel feature corresponding to the third conversion position in the intermediate point cloud feature In the point cloud feature, the target point cloud feature is obtained.

Specifically, the third coincident position refers to the coincidence position of the third conversion position and the target voxel position. The voxel feature corresponding to the third coincident position refers to the voxel feature at the corresponding position of the third coincident position in the voxel coordinate system. The server may perform feature fusion between the voxel feature corresponding to the third coincident position in the initial voxel feature and the point cloud feature corresponding to the third transformation position in the intermediate point cloud feature to obtain the target point cloud feature.

In the above embodiment, the current scene point cloud is voxelized to obtain the voxelization result, the voxel feature extraction is performed according to the voxelization result, the initial voxel feature is obtained, and the second transformation position and the first position of the target point cloud position are obtained. Two coincidence positions, in the initial image features, the image features corresponding to the second coincidence positions are fused into the point cloud features corresponding to the second coincidence positions in the initial point cloud features, to obtain the intermediate point cloud features, and obtain the corresponding point cloud of the current scene. The target voxel position, according to the coordinate conversion relationship between the voxel coordinate system and the point cloud coordinate system, convert the target voxel position to the position in the point cloud coordinate system, obtain the third conversion position, obtain the third conversion position and the target The third coincident position of the voxel position, the voxel feature corresponding to the third coincident position in the initial voxel feature is fused into the point cloud feature corresponding to the third transformation position in the intermediate point cloud feature, and the target point cloud is obtained. features, so that the intermediate point cloud features include point cloud features and image features, so that the target point cloud features include image features, point cloud features, and voxel features, which improves the feature richness in the target point cloud features. Improves the representation ability of target point cloud features. Combining the advantages of easy learning of voxel features and the information lossless advantages of point cloud features achieves the effect of complementary advantages.

In some embodiments, the method further includes: performing voxelization on the point cloud of the current scene to obtain a voxelization result; performing voxel feature extraction according to the voxelization result to obtain an initial voxel feature; obtaining the corresponding point cloud of the current scene According to the coordinate conversion relationship between the image coordinate system and the voxel coordinate system, convert the target image position to the position in the voxel coordinate system to obtain the fourth conversion position; obtain the fourth conversion position and the voxel For the fourth coincident position of the position, the image feature corresponding to the fourth coincident position in the initial image feature is fused into the voxel feature corresponding to the fourth coincident position in the initial voxel feature to obtain the target voxel feature.

Specifically, the coordinate conversion relationship between the image coordinate system and the voxel coordinate system refers to the conversion relationship of converting coordinates in the image coordinate system into coordinates in the voxel coordinate system. The object corresponding to the coordinates before transformation in the image coordinate system is the same as the object corresponding to the coordinates after transformation in the voxel coordinate system. The coordinate conversion relationship between the intermediate image coordinate system and the voxel coordinate system is described below as a fourth conversion relationship. The coordinates of the position represented by the coordinates in the image coordinate system in the voxel coordinate system can be determined through the fourth conversion relationship.

The fourth transformation position refers to the position corresponding to the target image position in the voxel coordinate system, and the fourth transformation position is the position in the three-dimensional coordinate system. The fourth conversion position may include the three-dimensional coordinates in the voxel coordinate system of all or part of the two-dimensional coordinates corresponding to the target image position. The fourth coincident position refers to a position where the fourth transformation position coincides with the target voxel position. The image feature corresponding to the fourth overlapping position refers to the image feature corresponding to the two-dimensional coordinates in the image coordinate system corresponding to the fourth overlapping position. The target voxel feature is a feature obtained by fusing the image feature corresponding to the fourth coincident position into the voxel feature corresponding to the fourth coincident position in the initial voxel feature.

In some embodiments, the server may perform feature fusion between the image feature corresponding to the fourth coincident position and the voxel feature corresponding to the fourth coincident position to obtain the target voxel feature.

In some embodiments, the server may convert the target voxel position into a position in the image coordinate system according to the coordinate transformation relationship between the voxel coordinate system and the image coordinate system, and obtain the image position corresponding to the target voxel position. The position proposes the corresponding image features from the initial image features, and fuses them into the initial voxel features to obtain the target voxel features. For example, the center position of the voxel can be projected into the image coordinate system to obtain the center image position, and the image features at the position where the difference between the original image feature and the center image position is less than the difference threshold can be extracted into the original voxel feature. , the target voxel features are obtained. The difference threshold can be set as required or preset.

In some embodiments, the object recognition model may further include a voxel spatial fusion layer, the server may input image features and voxel features into the voxel spatial fusion layer, and the voxel spatial fusion layer may determine the position and voxel of the image features The overlapping positions between the positions of the features are extracted, and the image features at the overlapping positions are extracted from the image features, and fused into the voxel features to obtain the target voxel features. The object recognition model can also include a point and voxel fusion layer. The server can input the target voxel features and intermediate point cloud features into the point and voxel fusion layer. The overlapping positions between the positions of the intermediate point cloud features are extracted, and the voxel features at the overlapping positions are extracted from the target voxel features, and fused into the intermediate point cloud features to obtain the target point cloud features. The point and voxel fusion layer can also be referred to as a point cloud and voxel fusion layer.

In the above embodiment, the current scene point cloud is voxelized to obtain a voxelization result, voxel feature extraction is performed according to the voxelization result, the initial voxel feature is obtained, and the target voxel position corresponding to the current scene point cloud is obtained, According to the coordinate conversion relationship between the image coordinate system and the voxel coordinate system, the target image position is converted into a position in the voxel coordinate system, the fourth conversion position is obtained, and the fourth coincident position between the fourth conversion position and the voxel position is obtained. , fuse the image features corresponding to the fourth coincident position in the initial image features into the voxel features corresponding to the fourth coincident position in the initial voxel features to obtain the target voxel features, so that the target voxel features include voxel features As well as image features, the representation ability of target voxel features and the richness of features are improved.

In some embodiments, determining the object position corresponding to the scene object based on the target image feature and the target point cloud feature includes: acquiring the associated scene image corresponding to the current scene image and the associated scene point cloud corresponding to the current scene point cloud; acquiring the associated scene The associated image features corresponding to the image, and the associated point cloud features corresponding to the associated scene point cloud; according to the chronological order, the feature fusion of the target image features and the associated image features is performed to obtain the target image temporal sequence features; according to the chronological order, the target point The cloud feature and the associated point cloud feature are fused to obtain the target point cloud time series feature; based on the target image time series feature and the target point cloud time series feature, the object position corresponding to the scene object is determined.

Specifically, the associated scene image refers to an image associated with the current scene image. For example, the associated scene image may be a forward frame collected before the current moment or a backward frame collected later by the image capture device that obtained the current scene image. . The forward frame can be used as the associated scene image, and the current scene image and the forward frame can also be detected for overlapping objects. If there is a coincident detection object between the current scene image and the forward frame, the forward frame is used as the current scene. The image's associated scene image. For example, if vehicle A exists in the current scene image and vehicle A also exists in the forward frame, the forward frame may be used as the associated scene image of the previous scene image. The current scene image and the associated scene image may be different video frames in the same video, for example, may be different video frames in the video captured by the image capturing device. The associated scene image may be a video frame captured before or after the current scene image. The method of obtaining the associated image features may refer to the obtaining method of the target image features.

The associated scene point cloud refers to the point cloud associated with the current scene point cloud. For example, the associated scene point cloud may be the scene point cloud collected before or after the current moment by the point cloud collection device that collected the current scene point cloud. The method of obtaining the associated point cloud features can refer to the obtaining method of the target point cloud features.

In some embodiments, the server may combine the target image feature and the associated image feature according to the time sequence of the associated scene image and the current scene image to obtain the combined image feature. Arranged before image features that are later in time. The server may obtain the target image time sequence feature according to the combined image feature, for example, the combined image feature may be used as the target image time sequence feature, or the combined image feature may be processed to obtain the target image time sequence feature.

In some embodiments, the server may combine the target point cloud feature and the associated point cloud feature according to the time sequence of the associated scene point cloud and the current scene point cloud to obtain the combined point cloud feature. The earlier point cloud features can be arranged before the later point cloud features. The server can obtain the target point cloud time series features according to the combined point cloud features. For example, the combined point cloud features can be used as the target point cloud time series features, or the combined point cloud features can be processed to obtain the target point cloud time series features.

In some embodiments, the server may obtain the associated voxel feature corresponding to the associated scene point cloud, and perform feature fusion on the target voxel feature and the associated voxel feature in a chronological order to obtain the target voxel time sequence feature.

In some embodiments, the server may perform feature fusion using target image time series features, target point cloud time series features, and target voxel time series features to obtain secondary fusion image features, secondary fusion time series features, and secondary fusion point cloud features. The feature fusion between the target image time series features, the target point cloud time series features, and the target voxel time series features can refer to the feature fusion method between the initial image features, the initial point cloud features, and the initial voxel features. The server can use the secondary fusion image features, secondary fusion voxel features, and secondary fusion point cloud features to perform image task learning, voxel task learning, and point cloud task learning, respectively.

In some embodiments, the image feature may include position information of the object, the server may obtain the position of the object in the target image feature, obtain the first position, and the position of the object in the associated image feature, obtain the second position, according to The first position and the second position determine the motion state of the object. For example, it can be determined whether the object has changed lanes or turned according to the relative relationship between the first position and the second position, and it can be determined according to the difference between the first position and the second position. Determines the speed of movement of the object. Of course, the point cloud feature and the voxel feature may also include the position information of the object, and also the point cloud feature and the voxel feature may be used to determine the motion state of the object.

In this embodiment, obtain the associated scene image corresponding to the current scene image, and the associated scene point cloud corresponding to the current scene point cloud, obtain the associated image feature corresponding to the associated scene image, and the associated point cloud feature corresponding to the associated scene point cloud, according to In chronological order, the feature fusion of the target image features and the associated image features is performed to obtain the target image time series features. According to the time sequence, the target point cloud features and the associated point cloud features are feature fused to obtain the target point cloud time series features. Based on the target The image timing features and the target point cloud timing features determine the object positions corresponding to the scene objects, so that the target image timing features include image features of different scene images and point cloud features of different scene point clouds, which improves the accuracy of scene object positions.

In some embodiments, determining the object position corresponding to the scene object based on the target image feature and the target point cloud feature includes: determining a combined position between the target image feature and the target point cloud feature to obtain the target combined position; and, combining the target combined position As the object position corresponding to the scene object.

Specifically, the combined position may be a combination of the position corresponding to the target image feature and the position corresponding to the target point cloud feature. The server can use the coordinates in the same coordinate system to represent the position corresponding to the target image feature and the position corresponding to the target point cloud feature, for example, use the coordinates in the image coordinate system to obtain the first feature position corresponding to the target image feature, and the second feature position corresponding to the target point cloud feature, and calculating the result of combining the first feature position and the second feature position to obtain the object position corresponding to the scene object. There can be multiple scene objects.

In this embodiment, the combined position between the target image feature and the target point cloud feature is determined to obtain the target combined position, and the target combined position is used as the object position corresponding to the scene object, which improves the accuracy of the object position.

In some embodiments, the server may perform task learning using at least one of target image features, target point cloud features, or target voxel features. Tasks can include low-level tasks and high-level tasks, and low-level tasks can include point-level semantic segmentation and scene flow estimation, voxel-level semantic segmentation and scene flow estimation, and pixel-level semantic segmentation and scene flow. estimate. High-level tasks can include object detection, scene recognition, and instance segmentation.

In some embodiments, as shown in FIG. 4, an object recognition system is provided, and the object recognition system mainly includes a first multi-sensor feature extraction (Multi-Sensor Feature Extraction) module, a temporal fusion (Temporal Fusion) module, a second Multi-sensor feature extraction module, Image View Tasks learning module, Voxel Tasks learning module and Point Tasks learning module. Wherein, each model can be implemented by one or more neural network models.

The multi-sensor feature extraction module supports the fusion method of a single sensor and multiple sensors, that is, the input can be the data collected by a single sensor, or the data collected by multiple sensors separately. The sensor may be, for example, at least one of an image acquisition device or a point cloud acquisition device. The multi-sensor feature extraction module includes Image Feature Extraction, Point Feature Extraction, Voxel Feature Extraction, Image Spatial Fusion, Point Cloud Point Spatial Fusion, Voxel Spatial Fusion, and Point-Voxel Fusion. The image spatial domain fusion module is used to fuse point cloud features into image features, the point cloud spatial domain fusion module is used to fuse image features into point features, and the voxel spatial domain fusion module is used to fuse image features into voxel features. The cloud and voxel fusion module is used to fuse point features into voxel features and voxel features into point features. The time series fusion module is used to fuse the features obtained from images of different frames, that is, the concatenation of feature dimensions.

The timing fusion module is used to fuse the before and after timing information of the features. For the image features, the image features can be concatenated in the pixel dimension, for example, the pixel dimension can be concatenated, or the two features can be correlated. For point cloud features, similar to FlowNe3D, feature extraction operations for each point field can be performed, similar to related operations, and operations for voxel features are similar to those for image features, but the operation of voxel features deals with three-dimensional The data.

In some embodiments, multi-sensor multi-task fusion can be performed by an object recognition system, which mainly includes the following steps:

Step 1: Input the image and point cloud of the frame before and after;

Step 2: Input the image and point cloud at each moment into the multi-sensor feature extraction module;

Step 3: The multi-sensor feature extraction module outputs image features, point features and voxel features at each moment respectively;

Step 4: The image features, point features and voxel features output by the multi-sensor feature extraction module are respectively time-series fusion to obtain three time-series features, namely image time-series features, point time-series features and voxel time-series features;

Step 5: Input the three time series features obtained in Step 4 into the multi-sensor feature extraction module, and perform feature fusion again to obtain the final image features, final point features and final voxel features.

Step 6: Based on the final image feature (Final ImageView Feature), the final point feature (Final Point Feature) and the final voxel feature (Final Voxel Feature), perform task learning at the image level, point level and voxel level.

Using the object recognition system proposed in the above embodiment, since different feature expression methods are adopted, that is, multiple features, such as image features and point cloud features, are used, and the features are fused, the effectiveness of feature learning is improved. Since the features can be obtained from data collected by different types of sensors, multi-sensor fusion is realized, and the robustness of the algorithm is improved. The multi-sensor feature extraction module can newly select the feature input of the sensor through the effective sensor, that is, it can select the effective sensor. The data collected by the sensor is used as the input data of the multi-sensor feature extraction module. For example, if the camera fails, the data collected by the lidar can be used for point tasks and voxel tasks. A camera failure may be a malfunction of the camera. A valid sensor can be a functioning sensor. The effectiveness of task learning is improved due to the coverage of tasks from low-level to high-level. However, during training, the whole task can be trained to improve the performance of the target task. According to business needs, in the inference stage, only the network branch corresponding to the task can be output to reduce the amount of calculation. Among them, inference is deep learning to apply the ability learned from training to work. The inference phase can be understood as the phase where the trained model is used. The object recognition system and object recognition method proposed above can be applied to autonomous driving perception algorithms. For autonomous vehicles equipped with cameras and lidars, tasks such as object detection, semantic segmentation, and scene flow estimation can be achieved. The results of scene flow estimation and semantic segmentation can be used as clues for non-deep learning object detection methods based on point clouds, such as the cost term of clustering in cluster-based object detection.

It should be understood that although the steps in the flowcharts of FIGS. 2-4 are shown in sequence according to the arrows, these steps are not necessarily executed in the sequence shown by the arrows. Unless explicitly stated herein, the execution of these steps is not strictly limited to the order, and these steps may be performed in other orders. Moreover, at least a part of the steps in FIGS. 2-4 may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily executed at the same time, but may be executed at different times. These sub-steps or stages are not necessarily completed at the same time. The order of execution of the steps is not necessarily sequential, but may be performed alternately or alternately with other steps or at least a part of sub-steps or stages of other steps.

In some embodiments, as shown in FIG. 5, an object recognition apparatus is provided, including: a current scene image acquisition module 502, an initial point cloud feature acquisition module 504, a target image feature acquisition module 506, and a target point cloud feature acquisition module 508. The position determination module 510 and the motion control module 512, wherein:

The current scene image acquisition module 502 is configured to acquire the current scene image and the current scene point cloud corresponding to the target moving object.

The initial point cloud feature obtaining module 504 is configured to perform image feature extraction on the current scene graph to obtain initial image features, and perform point cloud feature extraction on the current scene point cloud to obtain initial point cloud features.

The target image feature obtaining module 506 is used to obtain the target image position corresponding to the current scene image, and based on the initial point cloud features and the point cloud features corresponding to the target image position, the initial image features are fused to obtain the target image features.

The target point cloud feature obtaining module 508 is used to obtain the target point cloud position corresponding to the current scene point cloud, and based on the image features corresponding to the target point cloud position in the initial image features, the initial point cloud features are fused to obtain the target point cloud. feature.

The position determination module 510 is configured to determine the object position corresponding to the scene object based on the target image feature and the target point cloud feature.

The motion control module 512 is configured to control the target moving object to move based on the position corresponding to the scene object.

In some embodiments, the target image feature obtaining module 506 includes:

The first conversion position obtaining unit is used for converting the target point cloud position into the position in the image coordinate system according to the coordinate conversion relationship between the point cloud coordinate system and the image coordinate system, so as to obtain the first conversion position.

The target image feature obtaining unit is used to obtain the first coincidence position of the first conversion position and the target image position, and fuse the point cloud feature corresponding to the first coincidence position in the initial point cloud feature into the first coincidence position in the initial image feature From the corresponding image features, the target image features are obtained.

In some embodiments, the target point cloud feature obtaining module 508 includes:

The second conversion position obtaining unit is used for converting the target image position into a position in the point cloud coordinate system according to the coordinate conversion relationship between the image coordinate system and the point cloud coordinate system to obtain the second conversion position.

The target point cloud feature obtaining unit is used to obtain the second coincidence position between the second conversion position and the target point cloud position, and fuse the image features corresponding to the second coincidence position in the initial image feature into the second coincidence position in the initial point cloud feature. In the point cloud feature corresponding to the position, the target point cloud feature is obtained.

In some embodiments, the target point cloud feature obtaining unit is further configured to voxelize the current scene point cloud to obtain a voxelization result; perform voxel feature extraction according to the voxelization result to obtain an initial voxel feature; obtain The second conversion position and the second coincidence position of the target point cloud position, the image features corresponding to the second coincidence position in the initial image features are fused into the point cloud features corresponding to the second coincidence position in the initial point cloud features, and the middle point is obtained. Point cloud feature; obtain the target voxel position corresponding to the point cloud of the current scene, and convert the target voxel position to the position in the point cloud coordinate system according to the coordinate transformation relationship between the voxel coordinate system and the point cloud coordinate system, and obtain the first Three transformation positions; and obtaining the third overlapping position of the third transformation position and the target voxel position, and merging the voxel features corresponding to the third overlapping position in the initial voxel features into the third transformation in the intermediate point cloud feature In the point cloud feature corresponding to the position, the target point cloud feature is obtained.

In some embodiments, the apparatus further includes:

The voxelization result obtaining module is used to voxelize the point cloud of the current scene to obtain the voxelization result.

The initial voxel feature obtaining module is used to extract the voxel feature according to the voxelization result to obtain the initial voxel feature.

The fourth conversion position obtaining module is used to obtain the target voxel position corresponding to the point cloud of the current scene, and convert the target image position to the position in the voxel coordinate system according to the coordinate conversion relationship between the image coordinate system and the voxel coordinate system , to get the fourth transition position.

The target voxel feature obtaining module is used to obtain the fourth coincidence position between the fourth conversion position and the voxel position, and fuse the image features corresponding to the fourth coincidence position in the initial image features into the fourth coincidence position in the initial voxel feature From the corresponding voxel features, the target voxel features are obtained.

In some embodiments, the location determination module 510 includes:

The associated scene image acquisition unit is configured to acquire the associated scene image corresponding to the current scene image and the associated scene point cloud corresponding to the current scene point cloud.

The associated image feature acquisition unit is configured to acquire associated image features corresponding to the associated scene images and associated point cloud features corresponding to the associated scene point clouds.

The target image time sequence feature obtaining unit is used to perform feature fusion on the target image feature and the associated image feature according to the time sequence to obtain the target image time sequence feature.

The time series feature acquisition unit of the target point cloud is used to perform feature fusion on the target point cloud feature and the associated point cloud feature according to the time sequence to obtain the target point cloud time series feature.

The position determination unit is used for determining the position of the object corresponding to the scene object based on the time sequence feature of the target image and the time sequence feature of the target point cloud.

In some embodiments, the location determination module 510 includes:

The target combined position obtaining unit is used to determine the combined position between the target image feature and the target point cloud feature to obtain the target combined position.

The object position obtaining unit is used for taking the target combined position as the object position corresponding to the scene object.

For the specific limitation of the object recognition apparatus, reference may be made to the limitation of the object recognition method above, which will not be repeated here. Each module in the above-mentioned object recognition device may be implemented in whole or in part by software, hardware and combinations thereof. The above modules can be embedded in or independent of the processor in the computer device in the form of hardware, or stored in the memory in the computer device in the form of software, so that the processor can call and execute the operations corresponding to the above modules.

In some embodiments, a computer device is provided, and the computer device may be a server, and its internal structure diagram may be as shown in FIG. 6 . The computer device includes a processor, memory, a network interface, and a database connected by a system bus. Among them, the processor of the computer device is used to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium, an internal memory. The non-volatile storage medium stores an operating system, computer readable instructions and a database. The internal memory provides an environment for the execution of the operating system and computer-readable instructions in the non-volatile storage medium. The database of the computer device is used to store data such as the current scene image, the current point cloud image, point cloud features, image features, and voxel features. The network interface of the computer device is used to communicate with an external terminal through a network connection. The computer-readable instructions, when executed by a processor, implement a method of object recognition.

Those skilled in the art can understand that the structure shown in FIG. 6 is only a block diagram of a partial structure related to the solution of the present application, and does not constitute a limitation on the computer equipment to which the solution of the present application is applied. Include more or fewer components than shown in the figures, or combine certain components, or have a different arrangement of components.

A computer device includes a memory and one or more processors, where computer-readable instructions are stored in the memory, and when the computer-readable instructions are executed by the processors, cause the one or more processors to perform the steps of the above object identification method.

One or more computer storage media storing computer readable instructions which, when executed by one or more processors, cause the one or more processors to perform the steps of the above object identification method.

Wherein, the computer storage medium is a readable storage medium, and the readable storage medium may be non-volatile or volatile.

Those of ordinary skill in the art can understand that all or part of the processes in the methods of the above embodiments can be implemented by instructing the relevant hardware through computer-readable instructions, and the computer-readable instructions can be stored in a non-volatile computer. In the readable storage medium, the computer-readable instructions, when executed, may include the processes of the foregoing method embodiments. Wherein, any reference to memory, storage, database or other medium used in the various embodiments provided in this application may include non-volatile and/or volatile memory. Nonvolatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory may include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in various forms such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Road (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

The technical features of the above embodiments can be combined arbitrarily. For the sake of brevity, all possible combinations of the technical features in the above embodiments are not described. However, as long as there is no contradiction in the combination of these technical features, all It is considered to be the range described in this specification.

The above-mentioned embodiments only represent several embodiments of the present application, and the descriptions thereof are specific and detailed, but should not be construed as a limitation on the scope of the invention patent. It should be pointed out that for those skilled in the art, without departing from the concept of the present application, several modifications and improvements can be made, which all belong to the protection scope of the present application. Therefore, the scope of protection of the patent of the present application shall be subject to the appended claims.

Claims

An object recognition method comprising:

Obtain the current scene image and the current scene point cloud corresponding to the target moving object;

Perform image feature extraction on the current scene graph to obtain initial image features, and perform point cloud feature extraction on the current scene point cloud to obtain initial point cloud features;

Obtaining the target image position corresponding to the current scene image, and performing fusion processing on the initial image feature based on the point cloud feature corresponding to the target image position in the initial point cloud feature to obtain the target image feature;

Obtaining the target point cloud position corresponding to the point cloud of the current scene, and performing fusion processing on the initial point cloud feature based on the image feature corresponding to the target point cloud position in the initial image feature to obtain the target point cloud feature;

determining the object position corresponding to the scene object based on the target image feature and the target point cloud feature; and

The target moving object is controlled to move based on the position corresponding to the scene object.
The method according to claim 1, wherein, in the acquisition of the target image position corresponding to the current scene image, based on the initial point cloud feature, the point cloud feature corresponding to the target image position, the initial image The features are fused to obtain the target image features, including:

According to the coordinate conversion relationship between the point cloud coordinate system and the image coordinate system, the target point cloud position is converted into a position in the image coordinate system to obtain a first converted position; and

Obtain the first coincidence position of the first conversion position and the target image position, and fuse the point cloud feature corresponding to the first coincidence position in the initial point cloud feature into the initial image feature described in the From the image features corresponding to the first coincident position, the target image features are obtained.
The method according to claim 1, wherein the acquiring the target point cloud position corresponding to the current scene point cloud, based on the initial image features, the image features corresponding to the target point cloud position, for the initial The point cloud features are fused to obtain the target point cloud features, including:

According to the coordinate conversion relationship between the image coordinate system and the point cloud coordinate system, the target image position is converted into a position in the point cloud coordinate system to obtain a second converted position; and

Obtain the second coincidence position of the second conversion position and the target point cloud position, and fuse the image features corresponding to the second coincidence position in the initial image features into the initial point cloud features. From the point cloud feature corresponding to the second coincident position, the target point cloud feature is obtained.
The method according to claim 3, wherein the obtaining of the second coincidence position of the second conversion position and the target point cloud position, and the image corresponding to the second coincidence position in the initial image feature The feature is fused into the point cloud feature corresponding to the second coincident position in the initial point cloud feature, and the obtained target point cloud feature includes:

Perform voxelization on the current scene point cloud to obtain a voxelization result;

Perform voxel feature extraction according to the voxelization result to obtain initial voxel features;

Obtain the second coincidence position of the second conversion position and the target point cloud position, and fuse the image features corresponding to the second coincidence position in the initial image features into the initial point cloud features. In the point cloud feature corresponding to the second coincident position, obtain the intermediate point cloud feature;

Obtain the target voxel position corresponding to the point cloud of the current scene, and convert the target voxel position to the position in the point cloud coordinate system according to the coordinate transformation relationship between the voxel coordinate system and the point cloud coordinate system, and obtain the first three switching positions; and

Obtain the third coincidence position between the third conversion position and the target voxel position, and fuse the voxel feature corresponding to the third coincidence position in the initial voxel feature into the intermediate point cloud The target point cloud feature is obtained from the point cloud feature corresponding to the third transformation position in the feature.
The method of claim 1, wherein the method further comprises:

Perform voxelization on the current scene point cloud to obtain a voxelization result;

Perform voxel feature extraction according to the voxelization result to obtain initial voxel features;

Obtain the target voxel position corresponding to the point cloud of the current scene, and convert the target image position into a position in the voxel coordinate system according to the coordinate transformation relationship between the image coordinate system and the voxel coordinate system to obtain a fourth transformation location; and

Obtain the fourth coincidence position between the fourth conversion position and the voxel position, and fuse the image feature corresponding to the fourth coincidence position in the initial image feature into the first position in the initial voxel feature. Among the voxel features corresponding to the four overlapping positions, the target voxel features are obtained.
The method according to claim 1, wherein the determining the object position corresponding to the scene object based on the target image feature and the target point cloud feature comprises:

acquiring the associated scene image corresponding to the current scene image, and the associated scene point cloud corresponding to the current scene point cloud;

acquiring the associated image feature corresponding to the associated scene image, and the associated point cloud feature corresponding to the associated scene point cloud;

Perform feature fusion on the target image feature and the associated image feature according to the time sequence to obtain the target image time sequence feature;

Performing feature fusion on the target point cloud features and the associated point cloud features in a chronological order to obtain target point cloud time series features; and

Based on the time sequence feature of the target image and the time sequence feature of the target point cloud, the object position corresponding to the scene object is determined.
The method according to claim 1, wherein the determining the object position corresponding to the scene object based on the target image feature and the target point cloud feature comprises:

Determine the combined position between the target image feature and the target point cloud feature to obtain the target combined position; and

The target combination position is used as the object position corresponding to the scene object.
An object recognition device, comprising:

The current scene image acquisition module is used to acquire the current scene image and the current scene point cloud corresponding to the target moving object;

an initial point cloud feature obtaining module, used for performing image feature extraction on the current scene graph to obtain initial image features, and performing point cloud feature extraction on the current scene point cloud to obtain initial point cloud features;

The target image feature obtaining module is used to obtain the target image position corresponding to the current scene image, and based on the initial point cloud features, the point cloud features corresponding to the target image position, perform fusion processing on the initial image features, Get the target image features;

The target point cloud feature obtaining module is used to obtain the target point cloud position corresponding to the current scene point cloud, and based on the initial image features, the image features corresponding to the target point cloud position, perform the initial point cloud feature Fusion processing to obtain the target point cloud features;

a position determination module for determining an object position corresponding to a scene object based on the target image feature and the target point cloud feature; and

A motion control module, configured to control the target moving object to move based on the position corresponding to the scene object.
A computer device comprising a memory and one or more processors, the memory having computer-readable instructions stored in the memory that, when executed by the one or more processors, cause the one or more processors to A processor performs the steps of the method of any one of claims 1 to 7.
One or more computer storage media storing computer readable instructions which, when executed by one or more processors, cause the one or more processors to perform any one of claims 1 to 7 the steps of the method.