CN114998684A

CN114998684A - Training method and positioning adjustment method of geographic and visual cross-modal pre-training model

Info

Publication number: CN114998684A
Application number: CN202210637548.1A
Authority: CN
Inventors: 黄际洲; 刘希岩; 夏德国; 王海峰
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2022-05-20
Filing date: 2022-06-07
Publication date: 2022-09-02
Anticipated expiration: 2042-06-07
Also published as: CN114998684B

Abstract

The invention provides a training method and a positioning adjustment method of a geographic and visual cross-modal pre-training model, relates to the technical field of artificial intelligence, and particularly relates to the fields of natural language processing, computer vision and the like. The specific implementation scheme is as follows: and constructing a pre-training data set based on the map data, and performing model training on the model to be trained according to the pre-training data set and a pre-training target to obtain a first pre-training model for establishing the relation between geography and vision. By adopting the method and the device, the accuracy of the model can be improved.

Description

Training method and positioning adjustment method of geographic and visual cross-modal pre-training model

Cross Reference to Related Applications

The present disclosure claims priority from chinese patent application No. 202210557375.2, filed on 20/5/2022, which is incorporated herein by reference in its entirety.

Technical Field

The present disclosure relates to the field of artificial intelligence technology, and more particularly to the fields of natural language processing, computer vision, and the like.

Background

The high-precision map is used as a core module of the automatic driving system, rich prior guidance can be provided in the aspects of real-time navigation, path planning, behavior decision and the like, the quality of the high-precision map becomes a key ring influencing the development of the automatic driving industry to some extent, and the problem to be solved is how to obtain the more accurate high-precision map.

Disclosure of Invention

The disclosure provides a training method, a positioning adjustment method, a device, electronic equipment and a storage medium of a geographic and visual cross-modal pre-training model.

According to an aspect of the present disclosure, there is provided a training method of a geographic and visual cross-modal pre-training model, including:

constructing a pre-training data set based on the map data;

and performing model training on the model to be trained according to the pre-training data set and the pre-training target to obtain a first pre-training model for establishing the relation between geography and vision.

According to another aspect of the present disclosure, there is provided a positioning adjustment method, including:

obtaining target data according to the obtained crowdsourcing data and a first pre-training model for establishing a geographical and visual relation;

and adjusting the positioning accuracy of the crowdsourcing data according to the target data to obtain data matched with the positioning accuracy of the map data.

According to another aspect of the present disclosure, there is provided a training apparatus for pre-training a model across modalities, including:

the building module is used for building a pre-training data set based on the map data;

and the training module is used for carrying out model training on the model to be trained according to the pre-training data set and the pre-training target to obtain a first pre-training model for establishing the relation between geography and vision.

According to another aspect of the present disclosure, there is provided a positioning adjustment apparatus including:

the first processing module is used for obtaining target data according to the obtained crowdsourcing data and a first pre-training model used for establishing a geographic and visual relation;

and the positioning adjustment module is used for adjusting the positioning accuracy of the crowdsourcing data according to the target data to obtain data matched with the positioning accuracy of the map data.

According to another aspect of the present disclosure, there is provided an electronic device including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method provided by any one of the embodiments of the present disclosure.

According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium storing computer instructions for causing a computer to perform a method provided by any one of the embodiments of the present disclosure.

According to another aspect of the present disclosure, there is provided a computer program product comprising computer instructions which, when executed by a processor, implement the method provided by any one of the embodiments of the present disclosure.

By adopting the method and the device, the pre-training data set can be established based on the map data, and the model training can be carried out on the model to be trained according to the pre-training data set and the pre-training target, so that the first pre-training model for establishing the geographic and visual relation is obtained, and the precision of the model is improved.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a schematic diagram of a distributed cluster processing scenario according to an embodiment of the present disclosure;

FIG. 2 is a flow diagram of a method of training a geo-and visual cross-modal pre-training model according to an embodiment of the present disclosure;

fig. 3 is a schematic flow chart diagram of a positioning adjustment method according to an embodiment of the present disclosure;

FIG. 4 is a schematic diagram of a high-precision geographic location prediction based on a geographic and visual cross-modal pre-training model during a model usage phase according to an embodiment of the present disclosure;

FIG. 5 is a diagram of an application scenario during a model usage phase according to an embodiment of the present disclosure;

FIG. 6 is a schematic illustration of the training/use of a geo-and visual cross-modal pre-training model in an application example in accordance with an embodiment of the present disclosure;

FIG. 7 is a diagram illustrating a composition structure of a geographic and visual cross-modal pre-training model in an application example according to an embodiment of the present disclosure;

FIG. 8 is a schematic diagram of a positioning adjustment apparatus according to an embodiment of the present disclosure;

FIG. 9 is a schematic diagram of a training apparatus for a geographic and visual cross-modal pre-training model according to an embodiment of the present disclosure;

fig. 10 is a block diagram of an electronic device for implementing a positioning adjustment method/training method of a geographic and visual cross-modal pre-training model of an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

The term "and/or" herein is merely an association describing an associated object, meaning that three relationships may exist, e.g., a and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone. The term "at least one" herein means any combination of at least two of any one or more of a plurality, for example, including at least one of A, B, C, and may mean including any one or more elements selected from the group consisting of A, B and C. The terms "first" and "second" used herein refer to and distinguish one from another in the similar art, without necessarily implying a sequence or order, or implying only two, such as first and second, to indicate that there are two types/two, first and second, and first and second may also be one or more.

Furthermore, in the following detailed description, numerous specific details are set forth in order to provide a better understanding of the present disclosure. It will be understood by those skilled in the art that the present disclosure may be practiced without some of these specific details. In some instances, methods, means, elements and circuits that are well known to those skilled in the art have not been described in detail so as not to obscure the present disclosure.

The quality of high-precision maps (i.e. high precision, high richness and high freshness) is a key part of the development of the automatic driving industry, and is limited by the data production cost and the technical barrier of the high-precision maps, and the production and the updating of the high-precision maps face a serious challenge. On one hand, the high-precision map can acquire data and update data through the high-precision operation vehicle, however, due to the cost, the update frequency of the high-precision operation vehicle is low, so that efficient and rapid data update is difficult to realize. On the other hand, the high-precision map can acquire data and update data through a crowdsourcing mode, the crowdsourcing mode with low cost and high acquisition frequency can realize efficient and quick data updating, but the positioning precision can not meet the precision requirement of the high-precision map far away.

According to an embodiment of the present disclosure, fig. 1 is a schematic view of an application scenario of communication between an autonomous vehicle and a cloud end according to an embodiment of the present disclosure, as shown in fig. 1, including: a backend server 100, a plurality of vehicles (e.g., vehicle 107-vehicle 109), and a "cloud" 106 for communication between the backend server and the plurality of vehicles. The distributed cluster system can be adopted on one side of the background server, and exemplary description is given that crowdsourcing data reported by a plurality of vehicles can be received by the distributed cluster system, so that target data can be obtained according to the crowdsourcing data and a first pre-training model (namely a geography and vision cross-modal pre-training model) for establishing a geography and vision relationship, map data is updated according to the target data, and finally updated map data is obtained. The first pre-training model can be deployed at the vehicle-mounted terminal side corresponding to the plurality of vehicles, and can also be deployed at the background server side. If the first pre-training model is deployed at the vehicle-mounted terminal side corresponding to the multiple vehicles, the map updating task is executed based on the vehicle-mounted terminal, and if the first pre-training model is deployed at the background server side, as shown in fig. 1, the distributed cluster system includes multiple nodes (e.g., a server cluster 101, a server 102, a server cluster 103, a server 104, and a server 105), and one or more map updating tasks may be executed among the multiple nodes. Optionally, a plurality of nodes in the distributed cluster system may execute part of the processing flow in the map updating task, or may execute all the processing flow in the map updating task. Alternatively, after each round of data processing task is completed, data exchange (such as data synchronization) can be performed between multiple nodes.

According to an embodiment of the present disclosure, a training method of a geographic and visual cross-modal pre-training model is provided, fig. 2 is a schematic flow chart of the training method of the geographic and visual cross-modal pre-training model according to an embodiment of the present disclosure, and the method may be applied to a training device of the geographic and visual cross-modal pre-training model, for example, the device may be deployed in a terminal or a server or other processing equipment in a single machine, multiple machines or a cluster system, and may implement processing such as model training. The terminal may be a User Equipment (UE), a mobile device, a Personal Digital Assistant (PDA), a handheld device, a computing device, a vehicle-mounted device, a wearable device, or the like. In some possible implementations, the method may also be implemented by a processor invoking computer readable instructions stored in a memory. As shown in fig. 2, the method is applied to any node or electronic device (mobile phone or desktop computer, etc.) or vehicle-mounted terminal in the cluster system shown in fig. 1, and includes:

s201, constructing a pre-training data set based on the map data.

S202, model training is carried out on the model to be trained according to the pre-training data set and the pre-training target, and a first pre-training model used for building the relation between geography and vision is obtained.

In an example of S201-S202, the map data may be high-precision map data, and considering that the positioning precision of the high-precision map data is higher, the pre-training data set constructed based on the high-precision map data is more beneficial to improving the precision of model training. The pre-training target may be a model performance index expected to be achieved by model training, for example, the classification label output in the model training stage and the target label satisfying the pre-training target are subjected to loss operation to obtain a loss function, and the first pre-training model (i.e., the geographical and visual cross-modal pre-training model) is obtained after model training is performed according to the loss function.

By adopting the method and the device, the pre-training data set can be established according to the map data, so that the model training is carried out on the model to be trained according to the pre-training data set and the pre-training target, and the first pre-training model for establishing the geographic and visual relation is obtained.

It should be noted that, in the training phase of the first pre-training model, model training may be performed only using map data (such as high-precision map data) with high positioning accuracy, and in the using phase of the model, crowd-sourced data with fast update frequency may be only used, so that, according to the crowd-sourced data and target data (the target data is used for representing data matched with the positioning accuracy of the map data after the positioning accuracy of the crowd-sourced data is adjusted) obtained by the first pre-training model for establishing a geographic and visual relationship, not only is the advantage of fast update frequency of the crowd-sourced data utilized, but also the positioning accuracy is high. The map data is updated according to the target data, and the accuracy of the map data is improved based on the updated map data.

According to an embodiment of the present disclosure, a positioning adjustment method is provided, and fig. 3 is a schematic flowchart of the positioning adjustment method according to the embodiment of the present disclosure, and the method may be applied to a positioning adjustment device, for example, the device may be deployed in a terminal or a server or other processing devices in a single-machine, multi-machine, or cluster system, and may implement processing such as positioning adjustment. The terminal may be a User Equipment (UE), a mobile device, a Personal Digital Assistant (PDA), a handheld device, a computing device, a vehicle-mounted device, a wearable device, or the like. In some possible implementations, the method may also be implemented by a processor calling computer readable instructions stored in a memory. As shown in fig. 3, the method is applied to any node or electronic device (mobile phone or desktop computer, etc.) or vehicle-mounted terminal in the cluster system shown in fig. 1, and includes:

s301, obtaining target data according to the obtained crowdsourcing data and a first pre-training model for establishing a geographic and visual relation.

S302, adjusting the positioning accuracy of the crowdsourcing data according to the target data to obtain data matched with the positioning accuracy of the map data.

In an example of S301-S302, crowdsourcing data may be data collected by an owner, the crowdsourcing data may include first image information and first geographic location information thereof, and potential relationships between "geography-vision" may be mined based on the first pre-trained model (i.e., a geography-vision cross-modal pre-trained model), so as to establish a mapping relationship between the image information and the geographic location information in a map coordinate system (e.g., a high-precision map coordinate system). The crowd-sourced data is input into the first pre-training model, target data can be predicted (the target data can be image blocks in the first image information and corresponding geographical position codes of the image blocks), positioning accuracy of the crowd-sourced data can be adjusted according to the target data, and the target data is higher in positioning accuracy, so that data matched with positioning accuracy of map data can be obtained by adjusting positioning accuracy of the crowd-sourced data through the target data, and the crowd-sourced data is matched with the positioning accuracy of a high-precision map.

By adopting the method and the device, the target data can be obtained according to the acquired crowdsourcing data and the first pre-training model for establishing the relation between the geography and the vision, and the target data can represent the data matched with the positioning precision of the map data after the positioning precision of the crowdsourcing data is adjusted, so that the positioning precision of the crowdsourcing data is improved.

Based on the embodiment of the positioning adjustment method shown in fig. 3, the following is described:

in one embodiment, the map data may be updated according to the target data to obtain updated map data, so that the accuracy of the map data is improved based on the updated map data.

By adopting the embodiment, crowdsourcing data is input into the first pre-training model, target data can be predicted (the target data can be image blocks in the first image information and corresponding geographical position codes of the image blocks), and the positioning accuracy of the target data is higher, so that the positioning accuracy of the crowdsourcing data is adjusted by predicting through the first pre-training model, the positioning accuracy of the crowdsourcing data is matched, and finally, the map data can be updated according to the target data, so that the requirement of high data updating frequency is met, and the requirement of the positioning accuracy is also met.

In one embodiment, obtaining target data according to the obtained crowdsourcing data and a first pre-training model for establishing a geographic and visual relationship includes: extracting first image information and first geographical position information corresponding to the first image information from the crowd-sourced data, inputting the first image information and the first geographical position information into the first pre-training model, and establishing a mapping relation between the first image information and the first geographical position information in a map data coordinate system. And predicting the geographic position information corresponding to each image block in the first image information according to the mapping relation established by the first pre-training model, and outputting the predicted second geographic position information and the second image information comprising each image block. And taking the second geographical position information and the second image information as target data.

In some examples, the first image information may be subjected to image block division to obtain respective image blocks in the first image information.

In some examples, as shown in fig. 4, in the model using stage, the input of the first pre-training model is crowdsourcing data with a fast update frequency (or called high sampling frequency), taking the first image information in the crowdsourcing data as an example, the image geolocation coding information corresponding to the first image information is low-precision geolocation coding information, and through end-to-end processing of the first pre-training model, a geolocation prediction for the crowdsourcing data can be directly obtained, specifically, the geolocation prediction corresponding to each image block in the first image information is obtained. Wherein "end-to-end" means that in a model training stage, an original input is directly used as input data of model training, and compared with non-end-to-end, the non-end-to-end is that the original data is input into a model for training after characteristics are manually extracted, the manually extracted characteristics may have a possibility of inaccuracy, but end-to-end, the non-end manually extracted characteristics are not involved, the model training effect is better, and the model performance (such as model precision) is higher, so that, in a model using stage, through end-to-end processing of the first pre-training model, a potential relation between "geography-vision" for the first image information is sufficiently mined, and mapping between first image information extracted from crowdsourcing data and each entity object (such as each entity object represented by each image block) included in the first image information in a real-world map coordinate is directly established, therefore, the advantage of the first pre-training model is utilized to realize the fully-patterned prediction of the image geographical position codes, and a more accurate model output result is obtained, namely, the geographical position codes corresponding to all image blocks in the first image information are more accurately predicted.

In some examples, as shown in fig. 5. The first pre-training model may be deployed in a server on a network side, or may be deployed in a vehicle-mounted terminal, for example, the first pre-training model 5021 is deployed in the server 502 on the network side 503, the server 502 may have a management platform, the first pre-training model 5021 may be deployed in the management platform, or the high-precision map 5022 may be deployed in the management platform, the management platform may exchange data with the database 501, the management platform may further obtain crowd-sourced data reported by a plurality of vehicle-mounted terminals (e.g., vehicle-mounted terminal 5041 to vehicle-mounted terminal 5043), input image information extracted from the crowd-sourced data and geographic position encoding information thereof into the pre-training first pre-training model 5021, and may directly obtain image geographic position encoding information corresponding to each image block in the image information, so that the high-precision map 5022 may perform high-precision map data on the image block based on the image geographic position encoding information corresponding to each image block in the image information And updating to obtain updated high-precision map data.

With the present embodiment, the crowd-sourced data with low positioning accuracy, that is, the first image information and the first geographical location information corresponding to the first image information are input into the first pre-training model, and since the mapping relationship between the first image information and the first geographical location information can be established in the map data coordinate system, the geographical location information corresponding to each image block in the first image information can be predicted according to the mapping relationship established by the first pre-training model, and the target data with high positioning accuracy, which can be the second image information (e.g., each image block in the first image information) and the second geographical location information (e.g., geographical location information corresponding to each image block in the first image information), is output, so that the target data with high positioning accuracy can be predicted directly through the first pre-training model, the requirement of high data updating frequency is met, and the requirement of positioning precision is also met.

In one embodiment, the first geographical position information may be encoded to obtain first image position encoding information.

In some examples, the first image information and the first image position coding information may be input into the first pre-training model, a mapping relationship between the first image information and the first image position coding information is established in a map data coordinate system, and geographic position information corresponding to each image block in the first image information is predicted according to the mapping relationship established by the first pre-training model.

By adopting the embodiment, the first image position coding information which is more accurate than the first geographic position information can be obtained through coding, so that the first image information and the first image position coding information are input into the first pre-training model for prediction, and more accurate second geographic position information can be output, such as image block position coding information corresponding to each image block in the first image information in a map data coordinate system. Wherein the target data includes: the image block position coding information and the second image information (such as each image block in the first image information) corresponding to each image block in the first image information, so that the target data with high positioning precision can be predicted through the first pre-training model after the first geographical position information is coded, and the requirement of high data updating frequency is met, and the requirement of positioning precision is also met.

An embodiment of a training method based on the geographic and visual cross-modal pre-training model shown in fig. 2 is described as follows:

in one embodiment, constructing a pre-training data set based on map data includes: and under the condition that the map data is historical map data, screening out third image information meeting the first condition and third geographical position information corresponding to the third image information from the historical map data, and preprocessing the third image information and the third geographical position information to obtain a preprocessing result for representing image characteristics and geographical position coding characteristics. And constructing the pre-training data set according to the preprocessing result.

In some examples, the first condition includes: and carrying depth map information in the historical map data obtained by covering an area with the historical acquisition times exceeding N times (N is a positive integer greater than 2).

In some examples, for the preprocessing result, preprocessing the third image information and the third geographic location information to obtain a preprocessing result for characterizing the image feature and the geographic location code feature may include: and performing image preprocessing on the third image information to obtain fourth image information with the same resolution as the crowdsourcing data, and performing coding preprocessing on the third geographic position information to obtain fourth image position coding information. And performing division preprocessing on the fourth image information to obtain each image block in the fourth image information, and performing coding preprocessing on each image block in the fourth image information to obtain position coding information of each image block in the fourth image information. And using the fourth image information, the fourth image position coding information and the position coding information of each image block in the fourth image information as a preprocessing result.

In the present embodiment, in the model training stage, fourth image information with the same resolution as that of the crowdsourcing data can be obtained through image preprocessing, for example, image information extracted from the crowdsourcing data is referred to as a crowdsourcing image, the fourth image information with the same resolution as that of the crowdsourcing image is image information in high-precision map data and is referred to as a high-precision image, and the resolution of the high-precision image is adjusted to be consistent with that of the crowdsourcing image through resolution matching. In the model training stage, compared with the geographic position information, the geographic position coding information obtained through coding preprocessing is more accurate, in the model using stage, aiming at the crowdsourcing image of the input model, the geographic position of each image block in the crowdsourcing image, corresponding to the high-precision map coordinate system, can be accurately predicted, and the positioning accuracy is improved.

In one embodiment, performing model training on a model to be trained according to a pre-training data set and a pre-training target to obtain a first pre-training model for establishing a geographic and visual relationship, includes: and inputting a preprocessing result obtained from the pre-training data set into the model to be trained, and performing fusion processing on the preprocessing result by using the image characteristics and the geographic position coding characteristics to obtain fusion data. And performing feature extraction on the fusion data to obtain target features. Classifying the target features to obtain classification labels, performing loss operation according to the classification labels and the target labels meeting the pre-training targets to obtain a loss function, and performing model training on the model to be trained according to the loss function to obtain the trained first pre-training model.

In some examples, the model to be trained may include 3 modules, such as a multi-source information fusion module, a feature extraction module, and a classification module. For the pre-training target, the model input is an image and a geographical position code corresponding to the image, the output is the geographical position code of an image block of a high-precision map coordinate system associated with the image block divided in the image, a pre-processing result obtained from a pre-training data set is input into the model to be trained, and fusion processing of image features and geographical position code features is carried out on the pre-processing result through the multi-source information fusion module to obtain fusion data. And performing feature extraction on the fusion data through the feature extraction module to obtain target features. Classifying the target features through a classification module to obtain classification labels, performing loss operation according to the classification labels and the target labels meeting the pre-training targets to obtain a loss function, and performing model training on the model to be trained according to the loss function to obtain a first pre-training model.

By adopting the embodiment, the preprocessing result is subjected to fusion processing of the image features and the geographic position coding features and then subjected to feature extraction, more accurate target features can be obtained, the target features are classified according to the classification, loss operation is performed on the obtained classification labels and the target labels meeting the pre-training targets, more accurate loss functions can be obtained, model training is performed on the model to be trained according to the loss functions, the first pre-training model with more accurate positioning can be obtained, in the model using stage, aiming at the crowdsourcing image of the input model, the geographic position of each image block in the crowdsourcing image, corresponding to the high-precision map coordinate system, can be accurately predicted based on the first pre-training model, and the positioning precision is improved.

For the positioning accuracy of map data, 1) a statistical-based method can eliminate positioning deviation from the statistical viewpoint by mining massive crowdsourcing data to try to improve the positioning accuracy, but on one hand, the statistical-based method depends heavily on the quantity and quality of data, so that the robustness and generalization of the statistical-based method are difficult to meet the current industrial requirements of various scenes related to automatic driving; on the other hand, the method is limited by the inherent properties of the statistical algorithm, so that the method is lack of interpretability and is limited in the technical updating level. 2) Although the pipeline (pipeline) -based method can attempt to alleviate the problem of poor positioning accuracy of crowdsourcing data by combining processing flows of each sub-module such as track deviation correction, image registration and depth estimation, the pipeline-based method comprises independent processing flows of each sub-module such as track deviation correction, image registration and depth estimation, so that the overall accuracy obtained by combining the processing flows of each sub-module is easily influenced by the independent performance of each sub-module, and the accuracy is reduced; on the other hand, the whole frame is redundant due to the plurality of split sub-modules, the future updating and maintaining cost is high, complex scenes are difficult to process, and for example, when the shooting angles and the quality of the crowd data and the high-precision data are greatly different, algorithms such as track deviation correction or image registration are seriously influenced, so that the positioning accuracy is poor.

By adopting the following application example, model training can be performed based on high-precision map data with slow update frequency (or called low sampling frequency), crowdsourcing data with fast update frequency (or called high sampling frequency) is directly input in the model stage, end-to-end crowdsourcing data geographic position prediction is realized, and accuracy and interpretability of the geographic position prediction are improved.

In the application example, the high-precision map data is used as a basis, and a first pre-training model is obtained through pre-training, wherein the first pre-training model is a vision-based model. As shown in fig. 6, the first pre-training model takes the image information itself and the geo-location coding information corresponding to the image information (optionally, other auxiliary information such as image segmentation information, target identification information, etc.) as input of the first pre-training model, and predicts the geo-location coding information corresponding to each image block divided in the image information in an end-to-end manner. Briefly, the high-precision map data may be preprocessed, where the preprocessing includes geo-location coding information corresponding to image information, and the partitioning of each image block in the image information and the location coding information of each image block in the image information are performed based on a depth map and coding (e.g., S2 coding), and a pre-training data set for model training is constructed according to the data. After preprocessing, in a model training stage, the image information and geographical position coding information corresponding to the image information can be used as input to construct a first pre-training model based on a Deep Neural Network (DNN), and the first pre-training model takes geographical information coding (GeoCoding) based on vision as a pre-training target and carries out pre-training by predicting each position of the geographical coding. In the model using stage (or called model reasoning stage), only the images in the crowdsourcing data, namely the crowdsourcing images and the rough geographical position coding information thereof, are required to be used as the input of the first pre-training model, so that the geographical position information corresponding to each image block in the crowdsourcing images can be directly predicted, and the positioning precision is improved.

Specifically, the construction of the pre-training data set includes the following steps:

(1) and screening according to historical high-precision map data, namely historical high-precision images. The first condition for screening may include: and acquiring times and whether the depth map information is contained in the coverage area of the same area. Wherein, select the region that historical collection number of times exceeds 2, and guarantee as far as possible that the coverage is wider. For the depth map information, since the depth map information is effective information for calculating a true value (group channel), only the historical high-precision image with the depth map information can be selected to improve the performance of model training.

(2) Based on the data screened in (1), pretreatment can also be performed, and the pretreatment mainly comprises three parts (i) to (iii), namely: image processing, image position information encoding, and image block partitioning and encoding.

(i) For image processing, it is mainly considered to reduce the difference between training data (high-precision map data, such as high-precision images) and inference data (crowd-sourced data, such as crowd-sourced images), thereby alleviating the problem of reduced model generalization due to the difference of data fields. For example, the resolution of high-precision images and crowd-sourced images can be unified; in addition, due to the fact that the image quality of the crowdsourcing image is poor, the high-precision image can be preprocessed through a series of operations of fitting illumination, adding noise, changing contrast and the like, and the high-precision image is close to the image quality of the crowdsourcing image.

(ii) For the image position coding, the S2 coding method can be adopted. Dividing a real-world high-precision map coordinate system into a plurality of blocks, wherein each block corresponds to a mark (token) to represent, and correspondingly, cutting the high-precision image into a non-overlapping block (patch) sequence corresponding to each block. If the token lengths are different, the granularity of the corresponding blocks is different. Since the token representation of the same tile at each 2n-1 and 2n granularity levels differs only in the last character, it is possible to predict directly every position in the token: 2n-1 level characters, 2n level characters, and the penultimate character shared by 2n-1 and 2n levels.

(iii) For image block division and S2 encoding, the longitude and latitude of each pixel point in the high-precision image can be calculated based on the depth map information matched with the image, and then the distance d is used as the minimum granularity to determine the visible area in the high-precision imageThe view area is divided into K image blocks (K being a positive integer greater than 2). And (iii) encoding each image block based on the S2 encoding method in (ii), for example, d is 4 meters, which corresponds to 22 levels of S2 encoding. And finally obtaining the pre-training data set, wherein the pre-training data set comprises: image set I ═ { I ═ I ₁ ,I ₂ ,…,I _N H, image geolocation code set G ═ G ₁ ,g ₂ ,…,g _N And image block geolocation coding set B ═ B ₁ ,B ₂ ,B _i …,B _N }. Where N represents the number of data set samples,

and the geographical position of the image block of the ith image is coded, K is the number of divided blocks of each image, and S is the coding dimension of each image block. For the S2 encoding mode, the positioning accuracy of the high-precision images is high, the positioning accuracy of the crowdsourced images and the high-precision images is generally different from that of the crowdsourced images by positioning deviation of dozens of meters (such as 90 meters), the positioning ranges of the high-precision images and the crowdsourced images can be configured in advance through the S2 encoding mode, and the longitude and latitude within dozens of meters (such as 90 meters) of a square circle are configured in the same code, so that the positioning deviation between the crowdsourced images and the high-precision images is avoided.

Specifically, for model training, the following contents are included:

(1) as shown in fig. 7, the first pre-training model mainly includes three components, namely, a multi-source information fusion module, a feature extraction module, and a classification module.

It should be noted that, in order to facilitate the update and iteration of the model, the model may be designed to be a compatible structure, so as to benefit from the rapid development of basic modules such as a Convolutional Neural Network (CNN) or a transform Network (Transformer), and optionally, the feature extraction module is implemented by a Residual Neural Network (ResNet), and the classification module is implemented by a Multilayer Perceptron (MLP). Aiming at the multi-source information fusion module, multiple implementation modes can be provided, and the simplest fusion processing is that: different information is connected together along the characteristic channel dimension. Optionally, the fusion processing is performed in another bilinear module fusion mode, specifically, for a high-precision image input to the multi-source information fusion module and a corresponding geographical position code thereof, the bilinear module is constructed to fuse the two factors of the high-precision image and the corresponding geographical position code thereof (i.e., the image feature and the geographical information coding feature). The bilinear module is a two-factor model with separability mathematical characteristics, so that the model can adaptively learn the image characteristics and the geographic information coding characteristics, and when one factor is kept unchanged, the output is linear on any factor. In this way, the image features and the geo-location coding features may be seamlessly separated and combined. The fusion processing realized based on the bilinear module is realized by adopting the following formula (1):

F＝F _cv WF _geo (1)

in the formula (1), W is a learnable matrix, the dimensionalities of W are C × K × Q, C, K, Q are preset values respectively, C is a non-fixed value and can be configured as required, and K can be 100; q may take 300.

F _cv As a feature of the image, F _geo Features are coded for geographical positions corresponding to the image features, and F is a pair F _cv And F _geo And obtaining a K-dimensional fusion characteristic after the fusion processing.

After the multi-source information fusion module is adopted to perform the fusion processing, the feature extraction module is adopted to perform feature extraction on the fusion feature F output by the multi-source information fusion module, and a classifier is adopted to classify, so that a final loss function is obtained by adopting the following formula (2):

in the formula (2), L is a loss function, y is the sum

Respectively show diagramsThe block locations encode the true label and the model predicted label. N denotes the number of images, and N denotes a value from N-1, …, until N-N in the summation operation. K represents the number of blocks divided for each image, and K represents a value from K to 1 in the summation operation.

(2) Pre-training the target: the method comprises the steps of selecting a vision-based GeoCoding as a pre-training target, inputting images and corresponding geographical position codes of the images aiming at the pre-training target, outputting multi-level character expressions (namely the geographical position codes of image blocks) of coordinates associated with blocks divided in the images, and pre-training a model to enable the model to learn the association between the images and the positions of the images in the real world.

Specifically, the following contents are included for model usage:

(1) based on the first pre-training model, at the model using stage, crowdsourcing images can be conducted

Is encoded with coarse geographical location information

(2) Sending the codes and the corresponding images into the first pre-training model, and directly outputting the geographical position code information corresponding to each image block in the crowdsourced images through the first pre-training model based on the following formula (3):

in formula (3), the first pre-training model is denoted as ERNIE-GeoV (), and the input used by the first pre-training model is

And

respectively representing the crowd-sourced pictures and the geographical position coding information thereof, N represents the number of pictures,

and encoding information of the geographical position corresponding to each image block in the crowdsourced image for the output of the first pre-training model, so that the crowdsourced image is input into the first pre-training model, and the model is processed end to end, so that the high-precision longitude and latitude corresponding to each image block in the crowdsourced image can be directly obtained.

By adopting the application example, aiming at the problem of poor positioning accuracy of image information in crowdsourcing data, the first pre-training model is trained in advance, the first pre-training model directly learns the mapping relation between geography and vision to realize the prediction of end-to-end and more accurate geographic information coding, target data is predicted, and high-precision map data can be updated according to the target data, so that the updating of the high-precision map data based on the image information in the crowdsourcing data is possible; the method realizes cross-mode end-to-end training based on fusion of geographic position coding information and multiple information of image information, and classification processing in the model training, and can redefine a prediction task into a geographic information coding classification task for an image block; through image block division and S2 encoding in the image information, positioning precision deviation between high-precision map data and crowdsourcing data can be seamlessly and uniformly modeled, and final precision is embodied in an encoding granularity level, so that subsequent measurement and analysis are facilitated.

According to an embodiment of the present disclosure, there is provided a positioning adjustment apparatus, fig. 8 is a schematic structural diagram of the positioning adjustment apparatus according to the embodiment of the present disclosure, and as shown in fig. 8, the positioning adjustment apparatus includes: a first processing module 801, configured to obtain target data according to the obtained crowdsourcing data and a first pre-training model used for establishing a geographic and visual relationship; and a positioning adjustment module 802, configured to adjust positioning accuracy of the crowdsourcing data according to the target data, so as to obtain data matched with the positioning accuracy of the map data.

In one embodiment, the method further comprises: and the updating module is used for updating the map data according to the target data to obtain the updated map data.

In one embodiment, the first processing module is configured to extract first image information and first geographical location information corresponding to the first image information from the crowdsourcing data; inputting the first image information and the first geographic position information into the first pre-training model, and establishing a mapping relation between the first image information and the first geographic position information in a map data coordinate system; predicting the geographic position information corresponding to each image block in the first image information according to the mapping relation established by the first pre-training model, and outputting the predicted second geographic position information and second image information comprising each image block; and taking the second geographical position information and the second image information as the target data.

In one embodiment, the method further comprises: and the coding module is used for coding the first geographical position information to obtain first image position coding information.

In one embodiment, the method further comprises: and the dividing module is used for carrying out image block division on the first image information to obtain each image block in the first image information.

In one embodiment, the first processing module is configured to input the first image information and the first image position coding information into the first pre-training model.

In one embodiment, the second geographic location information is: and coding the position of the image block corresponding to each image block in the map data coordinate system.

According to an embodiment of the present disclosure, a model training apparatus is provided, fig. 9 is a schematic structural diagram of a training apparatus of a geographic and visual cross-modal pre-training model according to an embodiment of the present disclosure, and as shown in fig. 9, the training apparatus of the geographic and visual cross-modal pre-training model includes: a constructing module 901, configured to construct a pre-training data set based on map data; and the training module 902 is configured to perform model training on the model to be trained according to the pre-training data set and the pre-training target, so as to obtain a first pre-training model for establishing a geographic and visual relationship.

In one embodiment, the construction module is configured to, when the map data is historical map data, screen third image information satisfying a first condition and third geographical location information corresponding to the third image information from the historical map data; preprocessing the third image information and the third geographic position information to obtain a preprocessing result for representing image characteristics and geographic position coding characteristics; and constructing the pre-training data set according to the preprocessing result.

In one embodiment, the first condition comprises: carrying depth map information in the historical map data obtained from the coverage area with the historical acquisition times exceeding N times; and N is a positive integer greater than 2.

In one embodiment, the building module is configured to perform image preprocessing on the third image information to obtain fourth image information with a resolution equal to that of the crowdsourcing data; carrying out coding pretreatment on the third geographic position information to obtain fourth image position coding information; carrying out division preprocessing on the fourth image information to obtain each image block in the fourth image information; carrying out coding preprocessing on each image block in the fourth image information to obtain position coding information of each image block in the fourth image information; and using the fourth image information, the fourth image position coding information and each image block position coding information in the fourth image information as the preprocessing result.

In one embodiment, the training module is configured to input the preprocessing result obtained from the pre-training data set to the model to be trained, and perform fusion processing of the image feature and the geolocation coding feature on the preprocessing result to obtain fusion data; performing feature extraction on the fusion data to obtain target features; classifying the target features to obtain classification labels; performing loss operation according to the classification label and a target label meeting the pre-training target to obtain a loss function; and performing the model training on the model to be trained according to the loss function to obtain the first pre-training model.

In the technical scheme of the disclosure, the acquisition, storage, application and the like of the personal information of the related user all accord with the regulations of related laws and regulations, and do not violate the good customs of the public order.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

FIG. 10 illustrates a schematic block diagram of an example electronic device 1000 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 10, the electronic device 1000 includes a computing unit 1001 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM)1002 or a computer program loaded from a storage unit 1008 into a Random Access Memory (RAM) 1003. In the RAM 1003, various programs and data necessary for the operation of the electronic apparatus 1000 can also be stored. The calculation unit 1001, ROM 1002, and RAM 1003 are connected to each other by a bus 1004. An input/output (I/O) interface 1005 is also connected to bus 1004.

A number of components in the electronic device 1000 are connected to the I/O interface 1005, including: an input unit 1006 such as a keyboard, a mouse, and the like; an output unit 1007 such as various types of displays, speakers, and the like; a storage unit 1008 such as a magnetic disk, an optical disk, or the like; and a communication unit 1009 such as a network card, a modem, a wireless communication transceiver, or the like. The communication unit 1009 allows the electronic device 1000 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.

Computing unit 1001 may be a variety of general and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 1001 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The computing unit 1001 performs the various methods and processes described above, such as the location adjustment method/training method of the geographic and visual cross-modal pre-training model. For example, in some embodiments, the positioning adjustment method/training method of the geographic and visual cross-modal pre-training model may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as the storage unit 1008. In some embodiments, part or all of the computer program may be loaded and/or installed onto electronic device 1000 via ROM 1002 and/or communications unit 1009. When the computer program is loaded into RAM 1003 and executed by the computing unit 1001, one or more steps of the training method of the positioning adjustment method/geographical and visual cross-modal pre-training model described above may be performed. Alternatively, in other embodiments, the computing unit 1001 may be configured by any other suitable means (e.g., by means of firmware) to perform the positioning adjustment method/training method of the geographic and visual cross-modal pre-training model.

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server combining a blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel or sequentially or in different orders, and are not limited herein as long as the desired results of the technical solutions of the present disclosure can be achieved.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the protection scope of the present disclosure.

Claims

1. A training method of a geographic and visual cross-modal pre-training model comprises the following steps:

constructing a pre-training data set based on the map data;

2. The method of claim 1, wherein the constructing a pre-training data set based on map data comprises:

under the condition that the map data are historical map data, third image information meeting a first condition and third geographical position information corresponding to the third image information are screened from the historical map data;

preprocessing the third image information and the third geographic position information to obtain a preprocessing result for representing image characteristics and geographic position coding characteristics;

and constructing the pre-training data set according to the preprocessing result.

3. The method of claim 2, wherein the first condition comprises: carrying depth map information in the historical map data obtained from the coverage area with the historical acquisition times exceeding N times; and N is a positive integer greater than 2.

4. The method of claim 2, wherein the preprocessing the third image information and the third geographical location information to obtain a preprocessing result for characterizing image features and geographical location code features comprises:

performing image preprocessing on the third image information to obtain fourth image information with the same resolution as the crowdsourcing data;

carrying out coding pretreatment on the third geographic position information to obtain fourth image position coding information;

performing division preprocessing on the fourth image information to obtain each image block in the fourth image information;

carrying out coding preprocessing on each image block in the fourth image information to obtain position coding information of each image block in the fourth image information;

and using the fourth image information, the fourth image position coding information and each image block position coding information in the fourth image information as the preprocessing result.

5. The method according to any one of claims 2-4, wherein the model training of the model to be trained according to the pre-training dataset and a pre-training target to obtain a first pre-training model for establishing a geographic and visual relationship comprises:

inputting the preprocessing result obtained from the pre-training data set into the model to be trained, and performing fusion processing on the preprocessing result by using the image features and the geographic position coding features to obtain fusion data;

performing feature extraction on the fusion data to obtain target features;

classifying the target features to obtain classification labels;

performing loss operation according to the classification label and a target label meeting the pre-training target to obtain a loss function;

and performing the model training on the model to be trained according to the loss function to obtain the first pre-training model.

6. A method of positioning adjustment, comprising:

and adjusting the positioning precision of the crowdsourcing data according to the target data to obtain data matched with the positioning precision of the map data.

7. The method of claim 6, wherein obtaining target data from the obtained crowdsourcing data and a first pre-trained model for establishing geographic and visual relationships comprises:

extracting first image information and first geographical position information corresponding to the first image information from the crowdsourcing data;

inputting the first image information and the first geographic position information into the first pre-training model, and establishing a mapping relation between the first image information and the first geographic position information in a map data coordinate system;

predicting the geographic position information corresponding to each image block in the first image information according to the mapping relation established by the first pre-training model, and outputting the predicted second geographic position information and second image information comprising each image block;

and taking the second geographical position information and the second image information as the target data.

8. The method of claim 7, further comprising:

and coding the first geographical position information to obtain first image position coding information.

9. The method of claim 8, further comprising:

and carrying out image block division on the first image information to obtain each image block in the first image information.

10. The method of claim 8, wherein said inputting the first image information and the first geographic location information into the first pre-trained model comprises:

and inputting the first image information and the first image position coding information into the first pre-training model.

11. The method according to any one of claims 7-10, wherein the second geolocation information is: and encoding the position encoding information of the image blocks corresponding to the image blocks in the coordinate system of the map data.

12. The method according to any one of claims 6-11, further comprising:

and updating the map data according to the target data to obtain updated map data.

13. A training apparatus for geographic and visual cross-modal pre-training models, comprising:

the construction module is used for constructing a pre-training data set based on the map data;

and the training module is used for carrying out model training on the model to be trained according to the pre-training data set and the pre-training target to obtain a first pre-training model for establishing the relation between the geography and the vision.

14. The apparatus of claim 13, wherein the build module is to:

15. The apparatus of claim 14, wherein the first condition comprises: carrying depth map information in the historical map data obtained from the coverage area with the historical acquisition times exceeding N times; and N is a positive integer greater than 2.

16. The apparatus of claim 14, wherein the build module is to:

17. The apparatus of any of claims 14-16, wherein the training module is to:

performing feature extraction on the fusion data to obtain target features;

classifying the target features to obtain classification labels;

18. A positioning adjustment device, comprising:

the first processing module is used for obtaining target data according to the acquired crowdsourcing data and a first pre-training model used for establishing a geographic and visual relation;

19. The apparatus of claim 18, wherein the first processing module is configured to:

20. The apparatus of claim 19, further comprising: an encoding module to:

21. The apparatus of claim 20, further comprising: a partitioning module to:

22. The apparatus of claim 20, wherein the first processing module is configured to:

23. The apparatus according to any of claims 19-22, wherein the second geolocation information is: and coding the position of the image block corresponding to each image block in the map data coordinate system.

24. The apparatus of any of claims 18-23, further comprising:

and the updating module is used for updating the map data according to the target data to obtain the updated map data.

25. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-12.

26. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-12.

27. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-12.