CN116740498A

CN116740498A - Model pre-training method, model training method, object processing method and device

Info

Publication number: CN116740498A
Application number: CN202310701200.9A
Authority: CN
Inventors: 王学宽; 顾闻; 张伟; 谭啸
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2023-06-13
Filing date: 2023-06-13
Publication date: 2023-09-12
Anticipated expiration: 2043-06-13
Also published as: CN116740498B

Abstract

The disclosure discloses a model pre-training method, a model training method, an object processing method and a device, relates to the technical field of artificial intelligence, in particular to the technical field of computer vision, augmented reality, virtual reality and deep learning, and can be used for scenes such as automatic driving and intelligent traffic. The specific implementation scheme is as follows: inputting the first image sample into an image feature extraction network to obtain image features; inputting the first point cloud sample into a point cloud feature extraction network to obtain point cloud image features; determining a plurality of target points from the first image sample; mapping a plurality of first image point features corresponding to a plurality of target points in the image features to a bird's eye view space to obtain a plurality of second image point features; mapping a plurality of first point cloud features corresponding to a plurality of target points in the point cloud image features to a bird's eye view space to obtain a plurality of second point cloud features; and comparing and training the image feature extraction network and the point cloud feature extraction network by utilizing the plurality of second image point features and the plurality of second point cloud features.

Description

Model pre-training method, model training method, object processing method and device

Technical Field

The disclosure relates to the technical field of artificial intelligence, in particular to the technical field of computer vision, augmented reality, virtual reality, deep learning, which can be used for scenes such as automatic driving, intelligent traffic and the like, and particularly relates to a model pre-training method, a model training method, an object processing device, electronic equipment and a storage medium.

Background

Self-supervised learning can be to mine self-supervision information from large-scale non-supervision data by using auxiliary tasks, and train a network through the constructed supervision information, so that valuable characterization on downstream tasks can be learned. Compared with supervised training, the representation learned by the self-supervision training has better generalization capability, and better effect can be obtained when the self-supervision training is migrated to a downstream task.

Disclosure of Invention

The disclosure provides a model pre-training method, a model training method, an object processing device, electronic equipment and a storage medium.

According to an aspect of the present disclosure, there is provided a model pre-training method, including: inputting the first image sample into an image feature extraction network to obtain image features; inputting a first point cloud sample into a point cloud feature extraction network to obtain point cloud image features, wherein the first point cloud sample and the first image sample are acquired at the same time aiming at the same object; determining a plurality of target points from the first image sample; mapping a plurality of first image point features corresponding to the plurality of target points in the image features to a bird's eye view space to obtain a plurality of second image point features; mapping a plurality of first point cloud features corresponding to the plurality of target points in the point cloud image features to the aerial view space to obtain a plurality of second point cloud features; and performing contrast training on the image feature extraction network and the point cloud feature extraction network by using the plurality of second image point features and the plurality of second point cloud features.

According to another aspect of the present disclosure, there is provided a model training method including: inputting a third image sample into an image processing model to obtain a first output feature, wherein the image processing model comprises a pre-trained image feature extraction network and an image processing network; and obtaining a first loss based on the label of the third image sample and the first output feature, and training the image processing model based on the first loss.

According to another aspect of the present disclosure, there is provided a model training method including: inputting a third point cloud sample into a point cloud data processing model to obtain a second output characteristic, wherein the point cloud data processing model comprises a pre-trained point cloud characteristic extraction network and a point cloud data processing network; and obtaining a second loss based on the label of the third point cloud sample and the second output characteristic, and training the point cloud data processing model based on the second loss.

According to another aspect of the present disclosure, there is provided a model training method including: inputting a fourth image sample included in the training sample into a first backbone network of the object processing model to obtain a third output characteristic, wherein the first backbone network is a pre-trained image characteristic extraction network; inputting a fourth point cloud sample included in the training sample into a second backbone network of the object processing model to obtain a fourth output characteristic, wherein the second backbone network is a pre-trained point cloud characteristic extraction network; performing feature fusion on the third output feature and the fourth output feature to obtain a fusion feature; inputting the fusion characteristics into an object processing network of the object processing model to obtain fifth output characteristics; and obtaining a third loss based on the label of the training sample and the fifth output feature, and training the object processing model based on the third loss.

According to another aspect of the present disclosure, there is provided an object processing method including: and inputting the image to be processed comprising the target object into an image processing model to obtain an image processing result.

According to another aspect of the present disclosure, there is provided an object processing method including: and inputting the point cloud to be processed comprising the target object into a point cloud data processing model to obtain a point cloud data processing result.

According to another aspect of the present disclosure, there is provided an object processing method including: and inputting the image to be processed and the point cloud to be processed comprising the target object into an object processing model to obtain an object processing result.

According to another aspect of the present disclosure, there is provided a model pre-training apparatus, including: the first input module is used for inputting the first image sample into the image feature extraction network to obtain image features; the second input module is used for inputting a first point cloud sample into the point cloud feature extraction network to obtain point cloud image features, wherein the first point cloud sample and the first image sample are acquired at the same time aiming at the same object; a first determining module, configured to determine a plurality of target points from the first image sample; the first mapping module is used for mapping a plurality of first image point features corresponding to the plurality of target points in the image features to a bird's eye view space to obtain a plurality of second image point features; the second mapping module is used for mapping a plurality of first point cloud features corresponding to the plurality of target points in the point cloud image features to the aerial view space to obtain a plurality of second point cloud features; and the first training module is used for performing contrast training on the image feature extraction network and the point cloud feature extraction network by utilizing the plurality of second image point features and the plurality of second point cloud features.

According to another aspect of the present disclosure, there is provided a model training apparatus including: the third input module is used for inputting a third image sample into the image processing model to obtain a first output characteristic, wherein the image processing model comprises a pre-trained image characteristic extraction network and an image processing network; and a second training module configured to obtain a first loss based on the label of the third image sample and the first output feature, and train the image processing model based on the first loss.

According to another aspect of the present disclosure, there is provided a model training apparatus including: the fourth input module is used for inputting the third point cloud sample into the point cloud data processing model to obtain a second output characteristic, wherein the point cloud data processing model comprises a pre-trained point cloud characteristic extraction network and a point cloud data processing network; and a third training module, configured to obtain a second loss based on the label of the third point cloud sample and the second output feature, and train the point cloud data processing model based on the second loss.

According to another aspect of the present disclosure, there is provided a model training apparatus including: a fifth input module, configured to input a fourth image sample included in the training sample into a first backbone network of the object processing model to obtain a third output feature, where the first backbone network is a pre-trained image feature extraction network; a sixth input module, configured to input a fourth point cloud sample included in the training sample into a second backbone network of the object processing model to obtain a fourth output feature, where the second backbone network is a pre-trained point cloud feature extraction network; the feature fusion module is used for carrying out feature fusion on the third output feature and the fourth output feature to obtain fusion features; a seventh input module, configured to input the fusion feature into an object processing network of the object processing model, to obtain a fifth output feature; and a fourth training module configured to obtain a third loss based on the label of the training sample and the fifth output feature, and train the object processing model based on the third loss.

According to another aspect of the present disclosure, there is provided an object processing apparatus including: the first processing module is used for inputting the image to be processed including the target object into the image processing model to obtain an image processing result.

According to another aspect of the present disclosure, there is provided an object processing apparatus including: and the second processing module is used for inputting the point cloud to be processed comprising the target object into the point cloud data processing model to obtain a point cloud data processing result.

According to another aspect of the present disclosure, there is provided an object processing apparatus including: and the third processing module is used for inputting the image to be processed and the point cloud to be processed comprising the target object into the object processing model to obtain an object processing result.

According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method as described above.

According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the method as described above.

According to another aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements a method as described above.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

fig. 1 schematically illustrates a flow chart of a model pre-training method according to an embodiment of the present disclosure.

Fig. 2 schematically illustrates a schematic diagram of a model pre-training method according to an embodiment of the present disclosure.

Fig. 3 schematically illustrates a schematic diagram of a model pre-training method according to another embodiment of the present disclosure.

Fig. 4A schematically illustrates a flow chart of a model training method according to an embodiment of the present disclosure.

Fig. 4B schematically illustrates a flowchart of an object processing method according to an embodiment of the present disclosure.

Fig. 5A schematically illustrates a flow chart of a model training method according to another embodiment of the present disclosure.

Fig. 5B schematically illustrates a flow chart of an object processing method according to another embodiment of the present disclosure.

Fig. 6A schematically illustrates a flow chart of a model training method according to yet another embodiment of the present disclosure.

Fig. 6B schematically illustrates a flowchart of an object processing method according to a further embodiment of the present disclosure.

Fig. 7 schematically illustrates a block diagram of a model pre-training apparatus according to an embodiment of the disclosure.

Fig. 8A schematically illustrates a block diagram of a model training apparatus according to an embodiment of the present disclosure.

Fig. 8B schematically illustrates a block diagram of an object processing apparatus according to an embodiment of the present disclosure.

Fig. 9A schematically illustrates a block diagram of a model training apparatus according to another embodiment of the present disclosure.

Fig. 9B schematically illustrates a block diagram of an object processing apparatus according to another embodiment of the present disclosure.

Fig. 10A schematically illustrates a block diagram of a model training apparatus according to yet another embodiment of the present disclosure.

Fig. 10B schematically illustrates a block diagram of an object processing apparatus according to still another embodiment of the present disclosure.

FIG. 11 illustrates a schematic block diagram of an example electronic device that may be used to implement embodiments of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

With the continuous development of machine learning, computer vision and other technologies, the deep learning method is widely applied to the fields of automatic driving, intelligent transportation and the like, and is used for solving the downstream tasks of 3D object detection, 3D object segmentation, lane line detection, segmentation and the like. The success of deep learning in various downstream tasks is largely attributed to the robust characterization and learning capabilities of convolutional neural networks for image vision. This powerful characterization and learning capability enables the network to migrate between different tasks.

In the related art, a deep learning network is generally obtained by adopting a training method combining pre-training and fine-tuning, that is, training of the deep learning network can be divided into two steps, firstly, the network is pre-trained on a large-scale data set, so that the network can learn powerful visual characterizations suitable for an image understanding task, and then, the characterizations capability of the network learned on the large-scale data set are transferred to a downstream task with a relatively smaller-scale data set. Compared with the training on the downstream task directly, the training paradigm combining pre-training and fine-tuning can achieve better training effect while reducing the model training cost.

However, pre-training a network on a large data set generally relies on a large amount of manually marked data, and as the network grows deeper and the network scale expands, the manner in which the data is manually marked has failed to meet the training requirements of the network. As such, the related art increasingly uses an unsupervised learning method to implement pre-training of the network.

The self-supervision learning is used as a method for non-supervision learning, and aims to automatically generate a pseudo tag for non-tag data through a designed self-supervision task on the premise of not using image labeling, and pretrain a neural network through the pseudo tag and a corresponding self-supervision task. Compared with the supervised pre-training, the image representation obtained through the self-supervised pre-training has better generalization capability, and better effects can be obtained when the image representation is migrated to a downstream task.

The method for realizing the point cloud and the image multi-mode self-supervision in the related technology mainly comprises a method based on a mask automatic encoder combined by the image point cloud and the image multi-mode and contrast learning scheme based on the image point cloud.

The method based on the mask automatic encoder of the image point cloud multi-mode combination is characterized in that the image data and the point cloud data after mask processing are used as input of a depth neural network and are connected to the mask automatic encoder of two modes to obtain encoding results under different modes, however, the encoded features are respectively input to a decoder based on a transducer to predict the mask processed part in the original data, and the part can be represented as the original image data and the point cloud data or the feature data processed by a feature encoder.

The scheme based on image point cloud multi-modal contrast learning can be performed at several different granularities. One approach to example-level feature representation is based on the fact that the input of the approach is often point cloud data of some specific objects and their corresponding image data in indoor scenes, such as: a table, a chair, etc. For data of two modalities, an image feature encoder and a point cloud feature encoder can be used for alignment and feature encoding respectively, and the feature vectors are represented as two feature vectors with the same dimension. Then, the paired feature vectors are used as input of contrast learning to calculate a loss function, so that multi-mode feature self-supervision learning at an instance level is completed. The other scheme is based on the characteristic representation of the point level, and is different from the scheme, the method focuses on the characteristic of the granularity of the characteristic map, namely, after the characteristic is extracted by the deep convolutional neural network, the corresponding n-times down-sampled characteristic map can be taken as the input of image branches of contrast learning, the characteristic of the point cloud is obtained after the point cloud characteristic extractor is also used for obtaining the characteristic corresponding to the discrete point in the point cloud, then the mapping relation between the characteristic point in the image and the characteristic point in the point cloud is constructed through the inside and outside parameter matrix corresponding to the sensor equipment, further the paired data of the point characteristic level is obtained, and the data is taken as the input calculation loss function of the contrast learning to finish the multi-mode characteristic self-supervision learning of the point level.

However, in the scenes of automatic driving, intelligent traffic and the like, the number of background points existing in the image data and the point cloud data is generally far greater than the number of effective information points, so when the contrast learning is performed by using all paired points of the image data and the point cloud data in the related art, the learning emphasis of the network tends to pay attention to the background points with low value, and the effective information points with small number are ignored.

In view of this, the embodiments of the present disclosure provide a model pre-training method, a model training method, an object processing method, an apparatus, an electronic device, and a storage medium, which aim to replace random or all matching points by using an image key region as a matching point in image point cloud data under a bird's eye view perspective, so as to construct a set of more robust multi-mode data comparison learning schemes, thereby establishing an internal connection between point cloud features and image features, and realizing migration and complementation between point cloud-image multi-mode information, so as to improve feature expression capability of a deep learning model adopted under a single mode or multiple modes. Specifically, the model pre-training method comprises the following steps: inputting the first image sample into an image feature extraction network to obtain image features; inputting a first point cloud sample into a point cloud feature extraction network to obtain point cloud image features, wherein the first point cloud sample and the first image sample are acquired at the same time aiming at the same object; determining a plurality of target points from the first image sample; mapping a plurality of first image point features corresponding to a plurality of target points in the image features to a bird's eye view space to obtain a plurality of second image point features; mapping a plurality of first point cloud features corresponding to a plurality of target points in the point cloud image features to a bird's eye view space to obtain a plurality of second point cloud features; and performing contrast training on the image feature extraction network and the point cloud feature extraction network by utilizing the plurality of second image point features and the plurality of second point cloud features.

In the technical scheme of the disclosure, the related processes of collecting, storing, using, processing, transmitting, providing, disclosing, applying and the like of the personal information of the user all conform to the regulations of related laws and regulations, necessary security measures are adopted, and the public order harmony is not violated.

In the technical scheme of the disclosure, the authorization or consent of the user is obtained before the personal information of the user is obtained or acquired.

As shown in fig. 1, the method 100 includes operations S110 to S160.

In operation S110, a first image sample is input into an image feature extraction network to obtain image features.

In operation S120, the first point cloud sample is input into the point cloud feature extraction network to obtain a point cloud image feature.

In operation S130, a plurality of target points are determined from the first image sample.

In operation S140, a plurality of first image point features corresponding to a plurality of target points among the image features are mapped to the bird' S eye view space, resulting in a plurality of second image point features.

In operation S150, a plurality of first point cloud features corresponding to a plurality of target points in the point cloud image features are mapped to the aerial view space, and a plurality of second point cloud features are obtained.

In operation S160, the image feature extraction network and the point cloud feature extraction network are contrast trained using the plurality of second image point features and the plurality of second point cloud features.

According to an embodiment of the present disclosure, the first image sample and the first point cloud sample may be acquired at the same time for the same object, i.e. the first image sample and the first point cloud sample have been aligned in the time dimension before being used for training of the model. The radar device may be a laser radar or the like, and the image pickup device may be various cameras such as an RGB camera or the like. Or the first image sample and the first point cloud sample can be calibrated and clock-synchronized image data and point cloud data acquired in the scenes of intelligent traffic, automatic driving and the like.

According to embodiments of the present disclosure, a point cloud may refer to a set of three-dimensional coordinate points in a spatial coordinate system. The first point cloud sample may be a set of three-dimensional coordinate points of the surface of the acquired object.

According to an embodiment of the present disclosure, the network structure of the image feature extraction network may select, for example, a residual structure, a deep convolution structure, or the like, which is not limited herein. Alternatively, the network structure of the image feature extraction network may be selected according to a specific application scenario, for example, in a scenario of road surface object recognition, the network structure of the image feature extraction network may select a deep convolution structure.

According to the embodiment of the present disclosure, the point cloud feature extraction network may be a feature extraction network in units of voxels, such as a voxel network, or may be a feature extraction network in units of points, such as a point cloud detection network, a point pixel detection network, or the like, which is not limited herein.

According to an embodiment of the present disclosure, an image feature may be composed of a plurality of image point features, and in particular, an image feature may be expressed as 16×16×256, and then the image feature may be composed of 256 image point features of 1×256. Similarly, a point cloud image feature may be made up of a plurality of point cloud features. Alternatively, the image features and the point cloud image features obtained after feature extraction may have the same dimensions, or the image point features included in the image features and the point cloud features included in the point cloud image features may have the same dimensions.

According to an embodiment of the present disclosure, the plurality of target points are determined from the first image sample, and may be determined from a plurality of pixel points included in the first image sample. The target point may be a pixel point in the first image sample, where the effective information is included, for example, a pixel point in the foreground portion, and for example, a pixel point related to a key point such as a corner point or an article center point.

According to the embodiment of the disclosure, after determining the plurality of target points, the plurality of target points in the first image sample may be marked, and at the same time, points corresponding to the plurality of target points in the first point cloud sample may also be marked. Thus, after feature extraction, a plurality of first image point features corresponding to a plurality of target points may be determined based on the respective markers of the plurality of image point features in the image features, and accordingly, a plurality of first point cloud features corresponding to a plurality of target points may be determined based on the respective markers of the plurality of point cloud features in the point cloud image features. Alternatively, the determination of the plurality of first image point features and the plurality of first point cloud features may also be determined based on coordinate information of the plurality of target points in the image space or the point cloud space. Specifically, the coordinate information of the plurality of target points may be processed according to a processing manner similar to the feature extraction network, and the plurality of first image point features and the plurality of first point cloud features may be determined based on the processed coordinate information.

According to embodiments of the present disclosure, the bird's eye view may be represented as a projection view in a vertical direction. The points in the bird's eye view may have a value of zero in the height dimension.

According to embodiments of the present disclosure, the first image point feature is typically a 2-dimensional image feature, and thus, prior to mapping the first image point feature to the bird's eye view space, depth information may be supplemented for the first image point feature to convert the first image point feature to a 3-dimensional image feature. The method for determining the depth information of the first image point feature may be implemented by using various depth estimation methods, which are not limited herein.

According to an embodiment of the present disclosure, the contrast training may be a manner of taking the matched second point cloud feature and the second image point feature as positive sample pairs, taking the unmatched second point cloud feature and the second image point feature as negative sample pairs, and training using the positive sample pairs and the negative sample pairs.

According to the embodiment of the disclosure, in a model pre-training process based on self-supervision contrast learning, a plurality of target points can be selected, a plurality of first image point features corresponding to the plurality of target points in image features output by an image feature extraction network and a plurality of first point cloud features in point cloud image features output by a point cloud feature extraction network are respectively mapped into a bird's eye view space to obtain a plurality of second image point features and a plurality of second point cloud features, positive and negative sample pairs can be determined by using the plurality of second image point features and the plurality of second point cloud features, and contrast training of the image feature extraction network and the point cloud feature extraction network is completed by using the positive and negative sample pairs. The key region extraction algorithm is used for effectively mining key multi-mode point pair information, so that the self-supervision pre-training learning quality of a model can be effectively improved, more effective characteristic representation is extracted, better initialized network parameters are provided for downstream tasks, and performance indexes of the downstream tasks can be effectively improved.

The method illustrated in fig. 1 is further described below with reference to fig. 2 and 3 in conjunction with the exemplary embodiment.

According to an embodiment of the disclosure, the first image sample is acquired by an image acquisition device and the first point cloud sample is acquired by a radar device.

According to an embodiment of the present disclosure, the number of images included in the first image sample is not limited herein. For example, the first image sample may include images acquired by a plurality of image acquisition devices for the same subject at the same time, each of the plurality of image acquisition devices may have a different camera view angle. For example, 6 image acquisition devices may be provided for the same subject for acquisition of image samples, and the first image sample may include images acquired by the 6 image acquisition devices at the same time. Accordingly, the first point cloud sample may also include point clouds acquired by a plurality of radar devices for the same object at the same time, and the plurality of radar devices may have different perspectives. When the first image sample includes a plurality of images, feature extraction may be performed on each image, and then the extracted plurality of features may be mapped to the same bird's-eye view space, and the plurality of features may be fused in the bird's-eye view space.

According to the embodiment of the present disclosure, the image feature may be a feature output by any layer in the image feature extraction network, such as a feature output by each middle layer of the image feature extraction network, in addition to a feature output by an output layer of the image feature extraction network. In acquiring the image feature, the following operations may be performed:

and inputting the first image sample into an image feature extraction network to obtain the output features of each of a plurality of intermediate layers included in the image feature extraction network. Image features are determined from the output features of each of the plurality of intermediate layers.

According to embodiments of the present disclosure, image features may be determined from the respective output features of the plurality of intermediate layers according to the size of the point cloud image features to achieve alignment of the image features with the point cloud image features in the spatial dimension.

According to embodiments of the present disclosure, a plurality of target points may be determined based on the first image sample. Specifically, determining the plurality of target points from the first image sample may include the operations of:

at least one keypoint is determined from the first image sample based on shape characteristic information of the first image sample. For each keypoint, a keypoint region is determined centered around the keypoint. And determining the points contained in the key area as target points.

According to an embodiment of the present disclosure, the shape feature information of the first image sample may include corner information of an object existing in the first image sample, center point information of the object, key point information of the object, and the like. Accordingly, determining at least one key point from the first image sample based on the shape feature information of the first image sample may be determining a pixel point representing a corner or an object center point from the first image sample using a method based on corner detection, a method based on object center point detection, or a method based on key point detection. The corner detection-based method may include, for example, a Moravec method, a FAST (Features from Accelerated Segment Test, feature acceleration segmentation test) method, a Harris method, and the like. Methods based on object center point detection may include, for example, methods based on a centrnet implementation, methods based on a DETR (DEtection TRansformer) model implementation, and so on. The keypoint detection-based method may include, for example, SIFT (Scale Invariant Feature Transform, scale-invariant feature transform matching) method, superpoint method, and the like.

According to the embodiments of the present disclosure, the shape of the critical area is not limited herein, and for example, the critical area may be a circular area with a radius r centered on the critical point, or the critical area may be a rectangular area with a length m and a width n centered on the critical point.

According to the embodiment of the disclosure, after obtaining the plurality of target points, a plurality of first image point features corresponding to the plurality of target points may be determined from the image features, and a plurality of first point cloud features corresponding to the plurality of target points may be determined from the point cloud image features.

According to embodiments of the present disclosure, a plurality of target points may be processed to determine a plurality of first image point features corresponding to the plurality of target points from the image features based on similar operations as when the image feature extraction network processes the first image sample to obtain the image features. For example, the image feature extraction network may be a ResNet network and the middle layer of the image feature extraction network may be a downsampling layer. Specifically, the ResNet network can be divided into a plurality of network levels, each of which corresponds to downsampling the input first image sample by a different magnification. For example, the ResNet network may be divided into 4 network levels, which may perform 2-fold, 4-fold, 8-fold, and 16-fold downsampling operations on the input first image sample, respectively. Taking the first image sample of 256×256 as an example, 128×128×4 image features can be obtained after 2 times of downsampling operation, 64×64×16 image features can be obtained after 4 times of downsampling operation, 32×32×64 image features can be obtained after 8 times of downsampling operation, and 16×16×256 image features can be obtained after 16 times of downsampling operation.

According to an embodiment of the present disclosure, specifically, determining a plurality of first image point features corresponding to a plurality of target points from the image features may include the operations of:

first coordinate information of each of the plurality of target points in the image space is determined. And based on the downsampling multiplying power of the middle layer related to the image features in the image feature extraction network, performing downsampling processing on the first coordinate information of each of the plurality of target points to obtain the second coordinate information of each of the plurality of target points in the image space. A plurality of first image point features corresponding to the plurality of target points are determined from a plurality of image point features included in the image features based on second coordinate information of each of the plurality of target points in the image space.

According to an embodiment of the present disclosure, for example, the image feature may be a 16-times downsampled feature, and the first coordinate information of the target point a in the image space may be represented as (x, y). The first coordinate information of the target point a is subjected to 16-time downsampling processing, and the obtained second coordinate information can be expressed as (x/16, y/16). In some embodiments, the coordinate information obtained after the downsampling process may be further rounded to obtain second coordinate information. The rounding process may include a top rounding process, a bottom rounding process, a distance-based rounding process, and the like. When the distance-based rounding processing method is used for processing, four endpoints can be determined based on the values of the integer parts of the two coordinate values in the coordinate information obtained after the downsampling processing, the distances between the four endpoints and the coordinate values of the four endpoints are calculated respectively, and the coordinate information of the endpoint closest to the distance is taken as second coordinate information.

According to an embodiment of the present disclosure, one first image point feature may have a plurality of corresponding target points, specifically, after downsampling coordinate information of the plurality of target points, second coordinate information obtained by each of the plurality of target points may be the same, and each first image point feature may uniquely correspond to the second coordinate information, that is, the plurality of target points may correspond to the same first image point feature.

According to the embodiments of the present disclosure, similar to the method of determining the plurality of first image point features corresponding to the plurality of target points from the image features, when determining the plurality of first point cloud features corresponding to the plurality of target points from the point cloud features, third coordinate information of the plurality of target points in the point cloud space may be determined based on the first coordinate information of the plurality of target points in the image space. The fourth coordinate information of each of the plurality of target points may be determined based on the third coordinate information of each of the plurality of target points in a similar manner as when the point cloud feature extraction network processes the first point cloud sample. Based on fourth coordinate information of each of the plurality of target points, a plurality of first point cloud features corresponding to the plurality of target points may be determined from a plurality of point cloud features included in the point cloud image feature.

According to an embodiment of the present disclosure, as an optional implementation manner, after determining the plurality of first image point features, the second coordinate information determined by the plurality of target points in the image space may be mapped into the point cloud space by using a mapping relationship between different spaces, so as to determine a plurality of first point cloud features corresponding to the plurality of target points from the point cloud features. Specifically, second coordinate information of each of the plurality of target points in the image space may be mapped to the point cloud space based on a first camera inner-outer parameter matrix related to the first image sample and a second camera inner-outer parameter matrix related to the first point cloud sample, so as to obtain third coordinate information of each of the plurality of target points in the point cloud space. A plurality of first point cloud features corresponding to the plurality of target points are determined from a plurality of point cloud features included in the point cloud image features based on third coordinate information of each of the plurality of target points in the point cloud space.

According to embodiments of the present disclosure, the first camera internal and external parameter matrix may be associated with an image acquisition device that acquires the first image sample. The first camera intrinsic and extrinsic matrix may be comprised of camera intrinsic and camera extrinsic. The camera intrinsic may describe inherent properties of the image capturing device itself, including parameters such as intersection, pixel spacing, etc., and may determine the shape and size of a two-dimensional image captured by the image capturing device from a three-dimensional scene. Camera exogenous may describe the position and orientation of the image acquisition device in a three-dimensional scene, which may be represented using a rotation matrix and translation vectors. Similarly, the second camera internal and external reference matrix may be associated with a radar device from which the first point cloud sample was acquired.

According to the embodiment of the disclosure, the first camera inner and outer parameter matrix can be utilized to convert the second coordinate information of each of the plurality of target points in the image space into the reference coordinate information in the reference space, and then the second camera inner and outer parameter matrix is utilized to convert the reference coordinate information of the plurality of target points in the reference space into the third coordinate information in the point cloud space.

According to the embodiment of the disclosure, the matching precision of the feature points of different modalities can be effectively improved by determining the plurality of first image point features and the plurality of first point cloud features corresponding to the plurality of target points by using the mapping of the coordinate information.

According to embodiments of the present disclosure, the first image point feature of the image space may be represented as a 2-dimensional image feature, the dimensions of which are the height dimension and the width dimension, respectively. The second image point feature in the aerial view obtained by mapping can be represented as a 2-dimensional image feature, and the dimensions thereof are a depth dimension and a width dimension respectively. In mapping the first image point feature to the aerial view space, obtaining the plurality of second image point features may include a dimension up-scaling process that adds a depth dimension to the first image point feature, and a dimension down-scaling process that compresses the dimension up-scaled image point feature in a height dimension. Specifically, mapping a plurality of first image point features corresponding to a plurality of target points in the image features to the bird's eye view space, obtaining a plurality of second image point features may include the following operations:

Determining respective depth distribution information of a plurality of target points by using a depth estimation model; and mapping the first image point features corresponding to the target points to a bird's eye view space based on the depth distribution information of the target points, so as to obtain the second image point features.

According to embodiments of the present disclosure, the depth estimation model may be any monocular depth estimation network model, such as DORN (Deep Ordinal Regression Network, depth ordered regression network), or the like.

According to embodiments of the present disclosure, the depth estimation model may be a model that has been trained on a common data set. Alternatively, the depth coordinates in the point cloud sample may be used as a label of the training sample, and the image sample may be used as the training sample to train the first initial model to obtain the depth estimation model. The point cloud sample may be a second point cloud sample different from the first point cloud sample and the image sample may be a second image sample different from the first image sample. The second point cloud sample and the second image sample are acquired at the same time aiming at the same object. The first initial model may be trained with the second image sample as a training sample and depth information included in the second point cloud sample as a tag to obtain a depth estimation model. The training process of the depth estimation model is not described in detail herein.

According to an embodiment of the present disclosure, determining depth distribution information for each of a plurality of target points using a depth estimation model may include the operations of:

and inputting the first image sample into a depth estimation model to obtain depth distribution characteristics. Depth distribution information for each of a plurality of target points is determined from the depth distribution features.

According to embodiments of the present disclosure, a non-linear discretization method may be employed to divide a continuous depth range into a plurality of depth intervals and discretize each depth region into a category. The interval sizes of the plurality of depth intervals may be different. Specifically, the depth intervals can be divided according to the dense and sparse conditions of each point in the point cloud sample in different depth intervals. For example, the length of the space of the acquisition scene corresponding to the point cloud sample is 70 meters, that is, the depth value of each point in the point cloud sample is a value between 0 and 70 meters. The points in the point cloud sample can be sparser at two ends of the acquisition scene and denser at the central position of the acquisition scene, so that the interval range of the depth interval at the two ends can be larger and the interval range of the depth interval in the middle can be smaller when the depth interval is divided. For example, the depth range of 0 to 70 may be divided into the following depth intervals: (0, 20], (20, 30], (30, 35], (35, 38], (38, 41], (41, 47], (47, 55], (55, 70 ]). Each depth interval can correspond to one depth category, i.e. the depth category to which each point in the point cloud sample belongs can be determined according to the divided depth intervals.

According to an embodiment of the present disclosure, the depth distribution feature may include depth distribution information for each of a plurality of pixel points in the first image sample. The depth distribution information may be represented as a probability distribution of the pixel point over a plurality of depth categories, i.e. a probability distribution that the depth of the pixel point belongs to a plurality of depth categories.

According to embodiments of the present disclosure, depth distribution information for each of a plurality of target points may be determined from the depth distribution features. Specifically, depth distribution information of each of the plurality of target points may be determined from depth distribution information of each of the plurality of image point features included in the depth distribution feature based on first coordinate information of each of the plurality of target points in the image space.

According to an embodiment of the present disclosure, the depth distribution feature may overlap with the first sample image in image space, i.e., each pixel point in the first sample image may have corresponding depth distribution information in the depth distribution feature, and thus, after determining the pixel points corresponding to the plurality of target points, the respective depth distribution information of the plurality of target points may be determined.

According to the embodiment of the disclosure, after the depth distribution information of the target point is calculated, the depth distribution information may be collected, so as to complete the projection of the first image point feature in the image space to the second image point feature in the bird's eye view space. Specifically, mapping the plurality of first image point features corresponding to the plurality of target points to the bird's eye view space based on the depth distribution information of each of the plurality of target points, obtaining the plurality of second image point features may include the operations of:

And fusing the first image point features corresponding to the target points and the depth distribution information of the target points to obtain pseudo point cloud features. And mapping the multiple pseudo point cloud features in the image space to the aerial view space by using a first camera inner and outer parameter matrix related to the first image sample to obtain multiple third image point features. And fusing the third image point features with the same height value in the plurality of third image point features to obtain a plurality of second image point features.

According to the embodiment of the disclosure, an LSS (Lift, splat, shoot) algorithm may be used to convert depth distribution information into a depth value, and combine the depth value of the target point with a first image point feature of the target point in an image space to obtain a pseudo point cloud feature, so as to realize conversion of the target point from the image space to the pseudo point cloud space, thereby completing a dimension-up process of adding a depth dimension to the first image point feature.

According to embodiments of the present disclosure, after obtaining the pseudo-point cloud features, the pseudo-point cloud features may be projected into the aerial view space. Specifically, the pseudo point cloud feature may be subjected to rotation, translation, and other processes by using a first camera internal-external parameter matrix related to the first image sample, so as to eliminate errors existing during spatial transformation, thereby obtaining a third image point feature in the aerial view space.

According to the embodiment of the disclosure, the plurality of third image point features can be arranged on the same height of the aerial view space, so that the plurality of third image point features on the same height of the aerial view space can be fused to obtain the second image point feature on the height.

According to the embodiment of the disclosure, since the first point cloud feature in the point cloud space may be represented as a 3-dimensional feature point, that is, a feature point including a height feature, a width feature, and a depth feature, projecting the first point cloud feature in the point cloud space into the bird's eye view space may be a dimension reduction process of compressing in a height dimension.

According to an embodiment of the present disclosure, mapping a plurality of first point cloud features corresponding to a plurality of target points in the point cloud image features to a bird's eye view space, obtaining a plurality of second point cloud features may include the following operations:

and mapping the first point cloud features in the point cloud space to the aerial view space by using a second camera inner and outer parameter matrix related to the first point cloud sample to obtain a third point cloud features. And fusing the third point cloud features with the same height value in the plurality of third point cloud features to obtain a plurality of second point cloud features.

According to the embodiment of the disclosure, similar to the processing procedure of the pseudo point cloud features, the rotation, translation and other processing can be performed on the first point cloud features by using the second camera inner parameter matrix related to the second image sample, so as to eliminate errors existing in space transformation, and thus a third point cloud feature in the aerial view space is obtained.

According to the embodiment of the disclosure, similar to the processing procedure of the third image point feature, a plurality of third point cloud features may be present at the same height of the aerial view space, and thus, the plurality of third point cloud features at the same height of the aerial view space may be fused to obtain the second point cloud feature at the height.

According to the embodiment of the disclosure, after the projection of the first image point feature and the first point cloud feature in the aerial view space is completed, the feature alignment of the point cloud and the image multi-mode data in the aerial view space is completed. The aligned features can then be paired for use in self-supervised contrast learning of the model.

According to an embodiment of the present disclosure, using the plurality of second image point features and the plurality of second point cloud features, performing contrast training on the image feature extraction network and the point cloud feature extraction network may include the operations of:

A plurality of pairs of samples are generated based on the plurality of second image point features and the plurality of second point cloud features. The plurality of pairs of samples are divided into a plurality of positive pairs of samples and a plurality of negative pairs of samples based on target points corresponding to each of the second image point feature and the second point cloud feature included in the pairs of samples. The loss value is calculated based on the plurality of positive sample pairs and the plurality of negative sample pairs. And respectively adjusting the model parameters of the image feature extraction network and the model parameters of the point cloud feature extraction network by using the loss values.

According to embodiments of the present disclosure, the sample pair may include one second image point feature and one second point cloud feature. One second image point characteristic can be selected randomly from the plurality of second image point characteristics, and one second point cloud characteristic can be selected randomly from the plurality of second point cloud characteristics, so that a sample pair can be formed.

According to embodiments of the present disclosure, for each sample pair, the sample pair positive or negative may be determined based on whether the second image point feature and the second point cloud feature of the sample pair correspond to the same target point. Specifically, for each sample pair, in the case where the target point corresponding to the second image point feature included in the sample pair coincides with the target point corresponding to the second point cloud feature included in the sample pair, the sample pair is determined to be a positive sample pair. In the case that the target point corresponding to the second image point feature included in the sample pair does not coincide with the target point corresponding to the second point cloud feature included in the sample pair, the sample pair is determined to be a negative sample pair.

According to an embodiment of the present disclosure, for example, for a target point i, there is a second image point feature f corresponding to the target point i in the bird's eye view space _i E F and second point cloud feature p _i E P, F may represent a set of second image point features, P may represent a set of second point cloud features. Accordingly, for a target point j, which may be another target point than the target point i, there is a second image point feature f in the aerial view space corresponding to the target point j _j E F and second point cloud feature p _j e.P. Based on target point i and target point j, 4 sample pairs can be generated, respectively (f _i ，p _i )、(f _i ，p _j )、(f _j ，p _i ) And (f) _j ，p _j ). The 4 pairs of samples may be divided into positive pairs of samples (f _i ，p _i ) And (f) _j ，p _j ) And a negative sample pair (f _i ，p _j ) And (f) _j ，p _i )。

According to embodiments of the present disclosure, infoNCE may be employed as a loss function to accomplish spatial consistency constraints of point cloud and image features by maximizing the second image point feature and the second point cloud feature in a positive sample pair, as shown in equation (1):

in formula (1), L may represent a calculated loss value, exp (x) may represent an exponential operation with a natural constant e as a base, x as an exponent, and d (y, z) may be used to describe the similarity between y and z, which may be represented using euclidean distance, cosine distance, etc.

According to the embodiment of the disclosure, under the perspective of the aerial view, the image key region is utilized as the matching point in the image point cloud data to replace random or all matching points, so that a set of more robust multi-mode data comparison learning scheme is built, internal relation between the point cloud characteristics and the image characteristics is built, migration and complementation between the point cloud and the image multi-mode information are realized, and the characteristic expression capability of a deep learning model adopted under a single mode or multiple modes is improved. The model obtained by pre-training by the method can obtain better performance benefit after fine adjustment by using a small amount of data in a downstream task scene.

As shown in fig. 2, during the contrast learning pre-training process of the image feature extraction network 201 and the point cloud feature extraction network 202, the training sample set may include a plurality of training sample pairs, each of which may include a first image sample 203 and a first point cloud sample 204. The first image sample 203 and the first point cloud sample 204 in the training sample pair may be aligned in the time dimension.

According to embodiments of the present disclosure, the first image sample 203 may be input into an image feature extraction network 201 to obtain image features 205. The first image sample 203 may be processed using a target point extraction method to determine a plurality of image target points 206 from a plurality of pixel points included in the first image sample 203. The image features 205 may then be filtered based on the plurality of image target points 206 to obtain a plurality of first image point features 207. The first image sample 203 may be processed using a depth estimation network 208, and depth distribution features 209 may be derived that represent depth information for individual pixels in the first image sample 203. The depth distribution feature 209 is filtered based on the plurality of image target points 206, and depth distribution information 210 of each of the plurality of image target points 206 may be obtained. For each image target point 206, the depth distribution information 210 is fused with the first image point feature 207 using the LSS algorithm, resulting in a pseudo-point cloud feature 211 of that image target point 206 in image space. The plurality of pseudo point cloud features 211 are mapped to the bird's eye view space, and a plurality of second image point features 212 can be obtained.

According to embodiments of the present disclosure, the first point cloud sample 204 may be input into the point cloud feature extraction network 202 to obtain point cloud image features 213. The plurality of image target points 206 determined in the image space may be mapped into the point cloud space based on the camera inside-outside parameter matrix, and points in the point cloud space having the same height and width dimensions may be regarded as points to which the image target points 206 are mapped to determine a plurality of point cloud target points 214 present in the first point cloud sample 204. The point cloud image features 213 may be filtered using a plurality of point cloud target points 214 to obtain a plurality of first point cloud features 215. The plurality of first point cloud features 215 are mapped into the aerial view space, and a plurality of second point cloud features 216 may be obtained.

According to embodiments of the present disclosure, the plurality of second image point features 212 and the plurality of second point cloud features 216 may be paired in combination into a plurality of positive sample pairs and a plurality of negative sample pairs. The loss value 217 may be calculated based on a plurality of positive sample pairs and a plurality of negative sample pairs using InfoNCE as a loss function. The loss value 217 may be used for model referencing of the image feature extraction network 201 and the point cloud feature extraction network 202.

According to an embodiment of the present disclosure, as an alternative implementation manner, a target point may be determined from a first point cloud sample, and based on the target point, the filtering of the image point features and the point cloud features may be implemented.

As shown in fig. 3, a first point cloud sample 304 may be input to a point cloud feature extraction network 302 to obtain point cloud image features 313. The first point cloud sample 304 may be processed using a point extraction method to obtain a plurality of point cloud target points 314.

According to an embodiment of the present disclosure, when the first point cloud sample 304 is processed by applying the target point extraction method, the key point may be extracted from the first point cloud sample 304 by using the key point extraction method. The key point extraction method may include, for example, ISS (Intrinsic Shape Signatures, intrinsic shape signature) method, SIFT method, and the like, and is not limited herein. After determining the keypoint, a 3-dimensional region may be determined centered around the keypoint, and the points included within the 3-dimensional region are the desired point cloud target points 314. The 3-dimensional region may be a sphere region, a cube region, etc., and is not limited herein.

According to embodiments of the present disclosure, the point cloud image features 31 3 may be filtered based on the plurality of point cloud target points 314 to obtain a plurality of first point cloud features 315. The plurality of first point cloud features 315 are mapped into the aerial view space, and a plurality of second point cloud features 316 may be obtained.

According to embodiments of the present disclosure, the first image sample 303 may be input into an image feature extraction network 301 to obtain image features 305. The plurality of point cloud target points 314 determined in the point cloud space may be mapped into the image space based on the camera inside and outside parameter matrix, and the mapped points having the same depth may be compressed to determine the plurality of image target points 306 present in the first image sample 303. Image features 305 may then be filtered based on the plurality of image target points 306 to obtain a plurality of first image point features 307. The first image sample 303 may be processed using a depth estimation network 308, and depth distribution features 309 may be derived that represent depth information for individual pixels in the first image sample 303. The depth distribution feature 309 is filtered based on the plurality of image target points 306, and depth distribution information 310 of each of the plurality of image target points 306 may be obtained. For each image target point 306, the depth distribution information 310 is fused with the first image point feature 307 using the LSS algorithm, resulting in a pseudo-point cloud feature 311 of that image target point 306 in image space. And mapping the plurality of pseudo point cloud features 311 to the aerial view space, a plurality of second image point features 312 can be obtained.

According to embodiments of the present disclosure, the plurality of second image point features 312 and the plurality of second point cloud features 316 may be paired in combination into a plurality of positive sample pairs and a plurality of negative sample pairs. The loss value 317 may be calculated based on a plurality of positive sample pairs and a plurality of negative sample pairs using InfoNCE as a loss function. The loss values 317 may be used for model referencing of the image feature extraction network 301 and the point cloud feature extraction network 302.

According to the embodiment of the disclosure, after the pre-training of the image feature extraction network and the point cloud feature extraction network is completed, the image feature extraction network and/or the point cloud feature extraction network can be used as a backbone network to construct a model to be trained according to specific requirements of downstream tasks. Then, the model to be trained can be retrained by using the corresponding training sample, so that a model suitable for processing the downstream task is obtained.

According to embodiments of the present disclosure, the downstream tasks may include image data processing tasks such as image recognition, image classification, object detection tasks, and the like. Accordingly, the model required for this downstream task may use the image feature extraction network as a backbone network.

As shown in fig. 4A, the method 400A may include operations S410-S420.

In operation S410, the third image sample is input to the image processing model to obtain a first output feature.

In operation S420, a first penalty is obtained based on the labels and the first output features of the third image samples, and an image processing model is trained based on the first penalty.

According to embodiments of the present disclosure, an image processing model may include a pre-trained image feature extraction network and an image processing network. The image processing network may be a network associated with a particular downstream task, e.g., the downstream task is an image classification task, then the image processing network may be an image classification network. The pre-trained image feature extraction network may be trained using the model pre-training method described above, and is not described in detail herein.

As shown in fig. 4B, the method 400B may include operation S430.

In operation S430, an image to be processed including a target object is input into an image processing model, resulting in an image processing result.

According to embodiments of the present disclosure, the image processing model may be trained using a model training method as described in method 400A.

According to an embodiment of the present disclosure, for example, if the downstream task is an image classification task, the target object may be an object to be classified in the image to be processed, and the obtained image processing result may be represented as a classification prediction result of the object to be classified. For another example, if the downstream task is a target detection task, the target object may be an object to be detected in the image to be processed, and the obtained image processing result may include a position of a detection frame for the object to be detected, a reliability of the detection frame, and a category of the object to be detected.

According to embodiments of the present disclosure, the downstream tasks may include point cloud data processing tasks such as example segmentation, semantic segmentation, object detection, and the like. Accordingly, the model required for this downstream task may use the point cloud feature extraction network as the backbone network.

As shown in fig. 5A, the method 500A may include operations S510-S520.

In operation S510, the third point cloud sample is input to the point cloud data processing model to obtain a second output characteristic.

In operation S520, a second loss is obtained based on the label and the second output characteristic of the third point cloud sample, and a point cloud data processing model is trained based on the second loss.

According to embodiments of the present disclosure, a point cloud data processing model may include a pre-trained point cloud feature extraction network and a point cloud data processing network. The point cloud data processing network may be a network related to a specific downstream task, for example, the downstream task is a semantic segmentation task, and then the point cloud data processing network may be a point cloud semantic segmentation network. The pre-trained point cloud feature extraction network can be obtained by training the model pre-training method, and the description is omitted here.

As shown in fig. 5B, the method 500B may include operation S530.

In operation S530, the point cloud to be processed including the target object is input into the point cloud data processing model, and the point cloud data processing result is obtained.

According to an embodiment of the present disclosure, the point cloud data processing model may be trained using a model training method as described in method 500A.

According to the embodiment of the disclosure, for example, if the downstream task is a semantic segmentation task, the target object may be point cloud data in the point cloud to be processed, and the obtained point cloud data processing result may be represented as a semantic segmentation result of the point cloud to be processed. For another example, if the downstream task is a target detection task, the target object may be a cloud object or a voxel object to be detected in the image to be processed, and the obtained point cloud data processing result may include a position of a detection frame for the cloud object or voxel object to be detected, a reliability of the detection frame, and a class of the cloud object or voxel object to be detected.

According to the embodiment of the disclosure, the downstream tasks can also be various tasks such as three-dimensional target detection, three-dimensional target segmentation and the like, and the downstream tasks can use multi-mode data to train a model. Accordingly, the model required for this downstream task may use the image feature extraction network and the point cloud feature extraction network as backbone networks.

As shown in fig. 6A, the method 600A may include operations S610-S650.

In operation S610, a fourth image sample included in the training sample is input into the first backbone network of the object processing model, resulting in a third output feature.

In operation S620, a fourth point cloud sample included in the training sample is input to the second backbone network of the object processing model, and a fourth output feature is obtained.

In operation S630, feature fusion is performed on the third output feature and the fourth output feature, and a fusion feature is obtained.

In operation S640, the fusion feature is input to the object processing network of the object processing model, resulting in a fifth output feature.

In operation S650, a third penalty is obtained based on the labels of the training samples and the fifth output features, and the object processing model is trained based on the third penalty.

According to embodiments of the present disclosure, an object processing model may include a first backbone network, a second backbone network, and an object processing network. The first backbone network may be a pre-trained image feature extraction network, and the second backbone network may be a pre-trained point cloud feature extraction network, and the pre-trained image feature extraction network and the pre-trained point cloud feature extraction network may be obtained by training using the model pre-training method as described above, which is not described herein. The object handling network may be a network adapted to handle multimodal fusion features and may be associated with specific downstream tasks. For example, the downstream task may be a three-dimensional object detection task, and the object processing network may be a three-dimensional object detection network.

According to the embodiment of the present disclosure, the object processing network may also be a network with multiple inputs and one output, and feature fusion of the third output feature and the fourth output feature may also be performed in an intermediate layer of the object processing network, which is not described herein.

As shown in fig. 6B, the method 600B may include operation S660.

In operation S660, the image to be processed including the target object and the point cloud to be processed are input into the object processing model, and the object processing result is obtained.

According to embodiments of the present disclosure, an object processing model may be trained using a model training method as described in method 600A.

As shown in fig. 7, the model pre-training apparatus 700 includes a first input module 710, a second input module 720, a first determination module 730, a first mapping module 740, a second mapping module 750, and a first training module 760.

The first input module 710 is configured to input the first image sample into the image feature extraction network to obtain an image feature.

The second input module 720 is configured to input a first point cloud sample into the point cloud feature extraction network to obtain a point cloud image feature, where the first point cloud sample and the first image sample are acquired at the same time for the same object.

A first determining module 730 is configured to determine a plurality of target points from the first image sample.

The first mapping module 740 is configured to map a plurality of first image point features corresponding to a plurality of target points in the image features to a bird's eye view space, so as to obtain a plurality of second image point features.

The second mapping module 750 is configured to map a plurality of first point cloud features corresponding to a plurality of target points in the point cloud image features to a bird's eye view space, so as to obtain a plurality of second point cloud features.

The first training module 760 is configured to perform contrast training on the image feature extraction network and the point cloud feature extraction network by using the plurality of second image point features and the plurality of second point cloud features.

According to an embodiment of the present disclosure, the first determining module 730 includes a first determining unit, a second determining unit, and a third determining unit.

And a first determining unit configured to determine at least one key point from the first image sample based on the shape feature information of the first image sample.

And a second determination unit configured to determine, for each of the key points, a key region centering on the key point.

And a third determination unit configured to determine a point included in the critical area as a target point.

According to an embodiment of the present disclosure, the first mapping module 740 includes a first mapping unit and a second mapping unit.

And the first mapping unit is used for determining the depth distribution information of each of the plurality of target points by using the depth estimation model.

And the second mapping unit is used for mapping the first image point features corresponding to the target points to the aerial view space based on the depth distribution information of the target points, so as to obtain the second image point features.

According to an embodiment of the present disclosure, the first mapping unit comprises a first mapping subunit and a second mapping subunit.

And the first mapping subunit is used for inputting the first image sample into the depth estimation model to obtain the depth distribution characteristic.

And a second mapping subunit, configured to determine depth distribution information of each of the plurality of target points from the depth distribution features.

According to an embodiment of the present disclosure, the model pre-training apparatus 700 further comprises a fifth training module.

And the fifth training module is used for training the first initial model by taking the second image sample as a training sample and taking depth information included in the second point cloud sample as a label so as to obtain a depth estimation model, wherein the second point cloud sample and the second image sample are acquired at the same time aiming at the same object.

According to an embodiment of the present disclosure, the second mapping unit includes a third mapping subunit, a fourth mapping subunit, and a fifth mapping subunit.

And the third mapping subunit is used for fusing the first image point features corresponding to the target points and the depth distribution information of the target points to obtain the pseudo point cloud features.

And the fourth mapping subunit is used for mapping the multiple pseudo point cloud features in the image space to the aerial view space by using the first camera internal and external parameter matrix related to the first image sample to obtain multiple third image point features.

And the fifth mapping subunit is used for fusing the third image point features with the same height value in the plurality of third image point features to obtain a plurality of second image point features.

According to an embodiment of the present disclosure, the second mapping module 750 includes a third mapping unit and a fourth mapping unit.

And the third mapping unit is used for mapping a plurality of first point cloud features in the point cloud space to the aerial view space by using a second camera internal and external parameter matrix related to the first point cloud sample to obtain a plurality of third point cloud features.

And the fourth mapping unit is used for fusing the third point cloud features with the same height value in the plurality of third point cloud features to obtain a plurality of second point cloud features.

According to an embodiment of the present disclosure, the first training module 760 includes a first training unit, a second training unit, a third training unit, and a fourth training unit.

The first training unit is used for generating a plurality of sample pairs based on a plurality of second image point features and a plurality of second point cloud features, wherein the sample pairs comprise one second image point feature and one second point cloud feature.

And a second training unit configured to divide the plurality of samples into a plurality of positive sample pairs and a plurality of negative sample pairs based on target points corresponding to a second image point feature and a second point cloud feature included in the sample pairs, respectively.

And the third training unit is used for calculating a loss value based on the positive sample pairs and the negative sample pairs.

And the fourth training unit is used for respectively adjusting the model parameters of the image feature extraction network and the model parameters of the point cloud feature extraction network by using the loss values.

According to an embodiment of the present disclosure, the second training unit comprises a first training subunit and a second training subunit.

And the first training subunit is used for determining the sample pair as a positive sample pair when the target point corresponding to the second image point characteristic included in the sample pair is consistent with the target point corresponding to the second point cloud characteristic included in the sample pair.

And a second training subunit configured to determine that the sample pair is a negative sample pair if the target point corresponding to the second image point feature included in the sample pair and the target point corresponding to the second point cloud feature included in the sample pair are inconsistent.

According to an embodiment of the present disclosure, the first input module 710 includes a first input unit and a second input unit.

The first input unit is used for inputting the first image sample into the image feature extraction network to obtain the output features of each of a plurality of intermediate layers included in the image feature extraction network.

And a second input unit for determining image features from the output features of each of the plurality of intermediate layers.

According to an embodiment of the present disclosure, the middle layer of the image feature extraction network is the downsampling network layer.

According to an embodiment of the present disclosure, the model pre-training apparatus 700 further comprises a second determination module, a fourth processing module, and a third determination module.

And the second determining module is used for determining first coordinate information of each of the plurality of target points in the image space.

And the fourth processing module is used for carrying out downsampling processing on the first coordinate information of each of the plurality of target points based on the downsampling multiplying power of the middle layer related to the image characteristics in the image characteristic extraction network to obtain the second coordinate information of each of the plurality of target points in the image space.

And a third determining module for determining a plurality of first image point features corresponding to the plurality of target points from a plurality of image point features included in the image features based on second coordinate information of each of the plurality of target points in the image space.

According to an embodiment of the present disclosure, the model pre-training apparatus 700 further comprises a third mapping module and a fourth determination module.

And the third mapping module is used for mapping the second coordinate information of each of the plurality of target points in the image space to the point cloud space based on the first camera inner-outer parameter matrix related to the first image sample and the second camera inner-outer parameter matrix related to the first point cloud sample, so as to obtain the third coordinate information of each of the plurality of target points in the point cloud space.

And a fourth determining module, configured to determine, from a plurality of point cloud features included in the point cloud image feature, a plurality of first point cloud features corresponding to the plurality of target points based on third coordinate information of each of the plurality of target points in the point cloud space.

According to an embodiment of the present disclosure, the second mapping subunit includes a mapping component.

And a mapping component for determining depth distribution information of each of the plurality of target points from depth distribution information of each of the plurality of image point features included in the depth distribution feature based on first coordinate information of each of the plurality of target points in the image space.

According to an embodiment of the present disclosure, the first image sample comprises images acquired by a plurality of image acquisition devices for the same subject at the same time, each of the plurality of image acquisition devices having a different camera view angle.

As shown in fig. 8A, model training apparatus 800A includes a third input module 810 and a second training module 820.

A third input module 810 is configured to input a third image sample into an image processing model to obtain a first output feature, where the image processing model includes a pre-trained image feature extraction network and an image processing network.

The second training module 820 is configured to obtain a first loss based on the label and the first output feature of the third image sample, and train the image processing model based on the first loss.

According to an embodiment of the present disclosure, the pre-trained image feature extraction network comprises training using a model pre-training method as described above.

As shown in fig. 8B, the object processing apparatus 800B includes a first processing module 830.

The first processing module 830 is configured to input an image to be processed including the target object into the image processing model, to obtain an image processing result.

According to an embodiment of the present disclosure, the image processing model comprises training using a model training method as described above.

As shown in fig. 9A, model training apparatus 900A includes a fourth input module 910 and a third training module 920.

A fourth input module 910, configured to input the third point cloud sample into a point cloud data processing model to obtain a second output feature, where the point cloud data processing model includes a pre-trained point cloud feature extraction network and a point cloud data processing network.

The third training module 920 is configured to obtain a second loss based on the label of the third point cloud sample and the second output feature, and train the point cloud data processing model based on the second loss.

According to an embodiment of the present disclosure, the pre-trained point cloud feature extraction network comprises training using a model pre-training method as described above.

As shown in fig. 9B, the object processing apparatus 900B includes a second processing module 930.

The second processing module 930 is configured to input the point cloud to be processed including the target object into the point cloud data processing model, and obtain a point cloud data processing result.

According to an embodiment of the present disclosure, the point cloud data processing model includes training using the model training method as described above.

As shown in fig. 10A, the model training apparatus 1000A includes a fifth input module 1010, a sixth input module 1020, a feature fusion module 1030, a seventh input module 1040, and a fourth training module 1050.

A fifth input module 1010, configured to input a fourth image sample included in the training sample into a first backbone network of the object processing model, to obtain a third output feature, where the first backbone network is a pre-trained image feature extraction network.

And a sixth input module 1020, configured to input a fourth point cloud sample included in the training sample into a second backbone network of the object processing model to obtain a fourth output feature, where the second backbone network is a pre-trained point cloud feature extraction network.

And the feature fusion module 1030 is configured to perform feature fusion on the third output feature and the fourth output feature to obtain a fusion feature.

A seventh input module 1040 is configured to input the fusion feature into the object processing network of the object processing model, to obtain a fifth output feature.

A fourth training module 1050 for deriving a third penalty based on the labels of the training samples and the fifth output characteristic, and training the object handling model based on the third penalty.

According to an embodiment of the present disclosure, the pre-trained image feature extraction network and the pre-trained point cloud feature extraction network comprise training using a model pre-training method as described above.

As shown in fig. 10B, the object processing apparatus 1000B includes a third processing module 1060.

And a third processing module 1060, configured to input the to-be-processed image including the target object and the to-be-processed point cloud into the object processing model, and obtain an object processing result.

According to an embodiment of the present disclosure, the object handling model comprises being trained using the model training method as described above.

According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.

According to an embodiment of the present disclosure, an electronic device includes: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor, the instructions being executable by the at least one processor to enable the at least one processor to perform the method as described above.

According to an embodiment of the present disclosure, a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the method as described above.

According to an embodiment of the present disclosure, a computer program product comprising a computer program which, when executed by a processor, implements a method as described above.

FIG. 11 illustrates a schematic block diagram of an example electronic device that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 11, the apparatus 1100 includes a computing unit 1101 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 1102 or a computer program loaded from a storage unit 1108 into a Random Access Memory (RAM) 1103. In the RAM 1103, various programs and data required for the operation of the device 1100 can also be stored. The computing unit 1101, ROM 1102, and RAM 1103 are connected to each other by a bus 1104. An input/output (I/O) interface 1105 is also connected to bus 1104.

Various components in device 1100 are connected to an input/output (I/O) interface 1105, including: an input unit 1106 such as a keyboard, a mouse, etc.; an output unit 1107 such as various types of displays, speakers, and the like; a storage unit 1108, such as a magnetic disk, optical disk, etc.; and a communication unit 1109 such as a network card, modem, wireless communication transceiver, or the like. The communication unit 1109 allows the device 1100 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.

The computing unit 1101 may be a variety of general purpose and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 1101 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The calculation unit 1101 performs the respective methods and processes described above, such as a model pre-training method, a model training method, or an object processing method. For example, in some embodiments, the model pre-training method, the model training method, or the object processing method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 1108. In some embodiments, some or all of the computer programs may be loaded and/or installed onto device 1100 via ROM 1102 and/or communication unit 1109. When the computer program is loaded into the RAM 1103 and executed by the computing unit 1101, one or more steps of the model pre-training method, the model training method, or the object processing method described above may be performed. Alternatively, in other embodiments, the computing unit 1101 may be configured to perform a model pre-training method, a model training method, or an object processing method in any other suitable way (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server incorporating a blockchain.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel or sequentially or in a different order, provided that the desired results of the technical solutions of the present disclosure are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. A model pre-training method, comprising:

inputting the first image sample into an image feature extraction network to obtain image features;

inputting a first point cloud sample into a point cloud feature extraction network to obtain point cloud image features, wherein the first point cloud sample and the first image sample are acquired at the same time aiming at the same object;

determining a plurality of target points from the first image sample;

Mapping a plurality of first image point features corresponding to the target points in the image features to a bird's eye view space to obtain a plurality of second image point features;

mapping a plurality of first point cloud features corresponding to the plurality of target points in the point cloud image features to the aerial view space to obtain a plurality of second point cloud features; and

and comparing and training the image feature extraction network and the point cloud feature extraction network by utilizing the plurality of second image point features and the plurality of second point cloud features.

2. The method of claim 1, wherein the determining a plurality of target points from the first image sample comprises:

determining at least one key point from the first image sample based on shape feature information of the first image sample;

for each key point, determining a key area by taking the key point as a center; and

and determining the point contained in the key area as the target point.

3. The method of claim 1, wherein the mapping the first plurality of image point features corresponding to the target points to the aerial view space to obtain the second plurality of image point features includes:

Determining respective depth distribution information of the plurality of target points by using a depth estimation model; and

and mapping a plurality of first image point features corresponding to the target points to the aerial view space based on the depth distribution information of the target points, so as to obtain the second image point features.

4. The method of claim 3, wherein the determining depth distribution information for each of the plurality of target points using a depth estimation model comprises:

inputting the first image sample into the depth estimation model to obtain depth distribution characteristics; and

depth distribution information of each of the plurality of target points is determined from the depth distribution features.

5. The method of claim 3 or 4, further comprising:

and training the first initial model by taking the second image sample as a training sample and taking depth information included in the second point cloud sample as a label to obtain the depth estimation model, wherein the second point cloud sample and the second image sample are acquired at the same time aiming at the same object.

6. The method according to claim 3, wherein the mapping the plurality of first image point features corresponding to the plurality of target points to the bird's eye view space based on the depth distribution information of each of the plurality of target points to obtain the plurality of second image point features includes:

Fusing a plurality of first image point features corresponding to the plurality of target points and respective depth distribution information of the plurality of target points to obtain a plurality of pseudo point cloud features;

mapping the plurality of pseudo point cloud features in the image space to the aerial view space by using a first camera internal-external parameter matrix related to the first image sample to obtain a plurality of third image point features; and

and fusing the third image point features with the same height value in the plurality of third image point features to obtain the plurality of second image point features.

7. The method of claim 1, wherein the mapping the first plurality of point cloud features corresponding to the plurality of target points in the point cloud image features to the aerial view space to obtain the second plurality of point cloud features includes:

mapping the plurality of first point cloud features in a point cloud space to the aerial view space by using a second camera inner and outer parameter matrix related to the first point cloud sample to obtain a plurality of third point cloud features; and

and fusing the third point cloud features with the same height value in the plurality of third point cloud features to obtain the plurality of second point cloud features.

8. The method of claim 1, wherein the contrast training of the image feature extraction network and the point cloud feature extraction network using the plurality of second image point features and the plurality of second point cloud features comprises:

generating a plurality of sample pairs based on the plurality of second image point features and the plurality of second point cloud features, wherein the sample pairs include one of the second image point features and one of the second point cloud features;

dividing the plurality of sample pairs into a plurality of positive sample pairs and a plurality of negative sample pairs based on target points corresponding to the second image point features and the second point cloud features, respectively, included in the sample pairs;

calculating a loss value based on the plurality of positive sample pairs and the plurality of negative sample pairs; and

and respectively adjusting the model parameters of the image feature extraction network and the model parameters of the point cloud feature extraction network by using the loss values.

9. The method of claim 8, wherein the dividing the plurality of pairs of samples into a plurality of positive pairs of samples and a plurality of negative pairs of samples based on target points corresponding to the second image point features and the second point cloud features, respectively, included in the pairs of samples, comprises:

For each of the sample pairs, determining the sample pair as the positive sample pair if a target point corresponding to a second image point feature included in the sample pair coincides with a target point corresponding to a second point cloud feature included in the sample pair; and

and determining the sample pair as the negative sample pair in the case that the target point corresponding to the second image point characteristic included in the sample pair is inconsistent with the target point corresponding to the second point cloud characteristic included in the sample pair.

10. The method of claim 1, wherein said inputting the first image sample into the image feature extraction network results in image features comprising:

inputting the first image sample into the image feature extraction network to obtain respective output features of a plurality of intermediate layers included in the image feature extraction network; and

the image features are determined from the output features of each of the plurality of intermediate layers.

11. The method of any of claims 1-10, wherein an intermediate layer of the image feature extraction network is a downsampling network layer;

the method further comprises the steps of:

determining first coordinate information of each of the plurality of target points in an image space;

Based on the downsampling multiplying power of an intermediate layer related to the image features in the image feature extraction network, performing downsampling processing on the first coordinate information of each of the plurality of target points to obtain second coordinate information of each of the plurality of target points in the image space; and

and determining a plurality of first image point features corresponding to the plurality of target points from a plurality of image point features included in the image features based on second coordinate information of the plurality of target points in the image space.

12. The method of claim 11, further comprising:

mapping second coordinate information of each of the plurality of target points in the image space to a point cloud space based on a first camera inner-outer parameter matrix related to the first image sample and a second camera inner-outer parameter matrix related to the first point cloud sample, and obtaining third coordinate information of each of the plurality of target points in the point cloud space; and

and determining a plurality of first point cloud features corresponding to the plurality of target points from a plurality of point cloud features included in the point cloud image features based on third coordinate information of the plurality of target points in the point cloud space.

13. The method of claim 11, wherein determining depth distribution information for each of the plurality of target points from depth distribution features comprises:

and determining depth distribution information of each of the plurality of target points from depth distribution information of each of a plurality of image point features included in the depth distribution feature based on first coordinate information of each of the plurality of target points in the image space.

14. The method of claim 1, wherein the first image sample is acquired by an image acquisition device and the first point cloud sample is acquired by a radar device;

the first image sample includes images acquired by a plurality of image acquisition devices for the same subject at the same time, the plurality of image acquisition devices each having a different camera view angle.

15. A model training method, comprising:

inputting a third image sample into an image processing model to obtain a first output feature, wherein the image processing model comprises a pre-trained image feature extraction network and an image processing network; and

obtaining a first loss based on the label of the third image sample and the first output feature, and training the image processing model based on the first loss;

Wherein the pre-trained image feature extraction network comprises training using the model pre-training method according to any one of claims 1 to 14.

16. A model training method, comprising:

inputting a third point cloud sample into a point cloud data processing model to obtain a second output characteristic, wherein the point cloud data processing model comprises a pre-trained point cloud characteristic extraction network and a point cloud data processing network; and

obtaining a second loss based on the label of the third point cloud sample and the second output characteristic, and training the point cloud data processing model based on the second loss;

wherein the pre-trained point cloud feature extraction network comprises training using the model pre-training method according to any one of claims 1 to 14.

17. A model training method, comprising:

inputting a fourth image sample included in the training sample into a first backbone network of the object processing model to obtain a third output characteristic, wherein the first backbone network is a pre-trained image characteristic extraction network;

inputting a fourth point cloud sample included in the training sample into a second backbone network of the object processing model to obtain a fourth output characteristic, wherein the second backbone network is a pre-trained point cloud characteristic extraction network;

Performing feature fusion on the third output feature and the fourth output feature to obtain fusion features;

inputting the fusion features into an object processing network of the object processing model to obtain fifth output features; and

obtaining a third loss based on the label of the training sample and the fifth output characteristic, and training the object processing model based on the third loss;

wherein the pre-trained image feature extraction network and the pre-trained point cloud feature extraction network comprise training using the model pre-training method according to any one of claims 1 to 14.

18. An object processing method, comprising:

inputting an image to be processed comprising a target object into an image processing model to obtain an image processing result;

wherein the image processing model comprises training using the model training method of claim 15.

19. An object processing method, comprising:

inputting point clouds to be processed comprising target objects into a point cloud data processing model to obtain a point cloud data processing result;

wherein the point cloud data processing model comprises training using the model training method of claim 16.

20. An object processing method, comprising:

inputting an image to be processed and a point cloud to be processed, which comprise a target object, into an object processing model to obtain an object processing result;

wherein the object handling model comprises training using the model training method of claim 17.

21. A model pre-training apparatus, comprising:

the first input module is used for inputting the first image sample into the image feature extraction network to obtain image features;

the second input module is used for inputting a first point cloud sample into the point cloud feature extraction network to obtain point cloud image features, wherein the first point cloud sample and the first image sample are acquired at the same time aiming at the same object;

a first determining module for determining a plurality of target points from the first image sample;

the first mapping module is used for mapping a plurality of first image point features corresponding to the plurality of target points in the image features to a bird's eye view space to obtain a plurality of second image point features;

the second mapping module is used for mapping a plurality of first point cloud features corresponding to the plurality of target points in the point cloud image features to the aerial view space to obtain a plurality of second point cloud features; and

And the first training module is used for comparing and training the image feature extraction network and the point cloud feature extraction network by utilizing the plurality of second image point features and the plurality of second point cloud features.

22. A model training apparatus comprising:

the third input module is used for inputting a third image sample into the image processing model to obtain a first output characteristic, wherein the image processing model comprises a pre-trained image characteristic extraction network and an image processing network; and

the second training module is used for obtaining a first loss based on the label of the third image sample and the first output characteristic, and training the image processing model based on the first loss;

23. A model training apparatus comprising:

the fourth input module is used for inputting the third point cloud sample into the point cloud data processing model to obtain a second output characteristic, wherein the point cloud data processing model comprises a pre-trained point cloud characteristic extraction network and a point cloud data processing network; and

The third training module is used for obtaining a second loss based on the label of the third point cloud sample and the second output characteristic, and training the point cloud data processing model based on the second loss;

24. A model training apparatus comprising:

a fifth input module, configured to input a fourth image sample included in the training sample into a first backbone network of the object processing model to obtain a third output feature, where the first backbone network is a pre-trained image feature extraction network;

a sixth input module, configured to input a fourth point cloud sample included in the training sample into a second backbone network of the object processing model to obtain a fourth output feature, where the second backbone network is a pre-trained point cloud feature extraction network;

the feature fusion module is used for carrying out feature fusion on the third output feature and the fourth output feature to obtain fusion features;

a seventh input module, configured to input the fusion feature into an object processing network of the object processing model, to obtain a fifth output feature; and

A fourth training module, configured to obtain a third loss based on the label of the training sample and the fifth output feature, and train the object processing model based on the third loss;

25. An object processing apparatus comprising:

the first processing module is used for inputting the image to be processed comprising the target object into the image processing model to obtain an image processing result;

26. An object processing apparatus comprising:

the second processing module is used for inputting point clouds to be processed comprising target objects into the point cloud data processing model to obtain point cloud data processing results;

27. An object processing apparatus comprising:

the third processing module is used for inputting the image to be processed and the point cloud to be processed, which comprise the target object, into the object processing model to obtain an object processing result;

28. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-20.

29. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-20.

30. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any of claims 1-20.