WO2023065731A1

WO2023065731A1 - Method for training target map model, positioning method, and related apparatuses

Info

Publication number: WO2023065731A1
Application number: PCT/CN2022/104939
Authority: WO
Inventors: 黄际洲; 王海峰; 卓安; 孙一博
Original assignee: 北京百度网讯科技有限公司
Priority date: 2021-10-18
Filing date: 2022-07-11
Publication date: 2023-04-27
Also published as: CN113947147A; CN113947147B

Abstract

The present disclosure relates to the technical field of artificial intelligence, such as deep learning, natural language understanding and intelligent search. Provided are a method and apparatus for training a target map model, a positioning method and apparatus, an electronic device, a computer-readable storage medium, and a computer program product. The method for training a target map model comprises: acquiring a text expression, a coordinate vector expression and a signboard image expression of each map location point; performing training according to a first training sample composed of the text expression and the coordinate vector expression that correspond to the same map location point, so as to obtain a first sub-model; performing training according to a second training sample composed of the text expression and the signboard image expression that correspond to the same map location point, so as to obtain a second sub-model; and fusing the first sub-model and the second sub-model, so as to obtain a target map model. A target map model that is obtained by means of training using the method can better combine the current location of a user with an image that is captured for positioning, so as to obtain a more accurate positioning result.

Description

Target map model training method, positioning method and related device

Cross References to Related Applications

This patent application claims the priority of the Chinese patent application filed on October 18, 2021 with the application number 202111211145.2 and the title of the invention is "Training method for target map model, positioning method and related devices", the full text of which is cited incorporated into this application.

technical field

The present disclosure relates to the field of data processing technology, specifically to the field of artificial intelligence technology such as deep learning, natural language understanding, and intelligent search, and in particular to a training method and positioning method for a target map model, as well as corresponding devices, electronic equipment, and computer-readable Storage media and computer program products.

Background technique

Pre-trained models have made great progress in the field of natural language processing and products in multiple industries. By learning from large-scale data, the pre-training model can better model representations of words, words, sentences, etc. Based on the pre-trained model, fine-tuning the model using labeled samples of specific tasks can usually achieve very good results.

The map field is special, and the information processing process in the map field often needs to be associated with the real world. For example, in a map retrieval engine, when a user inputs a query word, the position of the candidate word itself and its distance from the user's current location are very important ranking features.

Currently, the text data in the map field is mainly structured data, and the information contained is relatively streamlined and limited, usually only names, aliases, addresses, and categories. However, information that has a strong correlation between the map field and the real world often cannot be expressed intuitively through text.

Contents of the invention

Embodiments of the present disclosure provide a method, device, electronic device, computer readable storage medium and computer program product for training and positioning a target map model.

In the first aspect, the embodiment of the present disclosure proposes a training method for a target map model, including: obtaining the text expression, coordinate vector expression, and signboard image expression of each map location point; The first training sample composed of vector expressions is trained to obtain the first sub-model; according to the second training sample composed of text expressions and signboard image expressions corresponding to the same map location point, the second sub-model is trained to obtain the second sub-model; the fusion of the first sub-model and The second sub-model obtains the target map model.

In the second aspect, the embodiment of the present disclosure proposes a training device for a target map model, including: a parameter acquisition unit configured to acquire the text expression, coordinate vector expression, and signboard image expression of each map location point; the first sub-model training The unit is configured to obtain the first sub-model according to the first training sample composed of text expression and coordinate vector expression corresponding to the same map position point; the second sub-model training unit is configured to obtain the first sub-model according to the corresponding same map position point The second training sample composed of the text expression and the signboard image expression is trained to obtain the second sub-model; the sub-model fusion unit is configured to fuse the first sub-model and the second sub-model to obtain the target map model.

In the third aspect, the embodiment of the present disclosure proposes a positioning method, including: acquiring the positioning image obtained by shooting the signboard of the target building and the current position; determining the actual coordinate vector expression according to the current position, and determining the actual signboard according to the positioning image Image expression; call the target map model to determine the text expression of the shooting position corresponding to the actual coordinate vector expression; call the target map model to determine the alternative text expression sequence corresponding to the actual signboard image expression; based on the text expression of the shooting position and the alternative text expression sequence Adjust the presentation priority of each alternative text expression in the sequence based on the distance between the alternative text expressions; locate the actual location of the target building based on the adjusted alternative text expression sequence based on the presentation priority; target map model Obtained according to the training method of the target map model as described in any implementation manner in the first aspect.

In the fourth aspect, the embodiment of the present disclosure proposes a positioning device, including: an image for positioning and a current position acquisition unit configured to acquire the image for positioning and the current position obtained by photographing the signboard of the target building; the actual coordinate vector expression and an actual signboard image expression determining unit, configured to determine the actual coordinate vector expression according to the current position, and determine the actual signboard image expression according to the positioning image; the shooting position text expression determining unit is configured to call the target map model determination and the actual coordinate vector expression The corresponding shooting position text expression; the alternative text expression sequence determination unit is configured to call the target map model to determine the alternative text expression sequence corresponding to the actual sign image expression; the presentation priority adjustment unit is configured to be based on the shooting position text The distance between the expression and each candidate text expression in the candidate text expression sequence adjusts the presentation priority ranking of each candidate text expression in the sequence; the actual position determination unit is configured to adjust the candidate text expression based on the presentation priority Select the text expression sequence, locate the actual location of the target building, and obtain the target map model according to the training device for the target map model described in any implementation manner in the second aspect.

In a fifth aspect, an embodiment of the present disclosure provides an electronic device, the electronic device includes: at least one processor; and a memory communicatively connected to the at least one processor; wherein, the memory stores instructions executable by the at least one processor , the instructions are executed by at least one processor, so that at least one processor can realize the training method of the target map model as described in any implementation manner in the first aspect or the positioning as described in any implementation manner in the third aspect when executed method.

In the sixth aspect, the embodiments of the present disclosure provide a non-transitory computer-readable storage medium storing computer instructions, the computer instructions are used to enable the computer to implement the target map model described in any implementation manner in the first aspect. The training method or the positioning method described in any implementation manner in the third aspect.

In the seventh aspect, the embodiments of the present disclosure provide a computer program product including a computer program. When the computer program is executed by a processor, it can realize the training method of the target map model as described in any implementation manner in the first aspect or as described in The positioning method described in any implementation manner in the third aspect.

The training method and positioning method of the target map model provided by the embodiments of the present disclosure are not only based on the text expression of the map position point during training, but also additionally introduce coordinate vector expression and signboard image expression, so that pre-training in multiple dimensions The model makes full use of the spatio-temporal big data in the field of maps to make the information contained in the pre-training model more relevant to the real world, and then in actual application, it can better combine the user's current location and the captured positioning images to obtain more accurate results. positioning results.

It should be understood that what is described in this section is not intended to identify key or important features of the embodiments of the present disclosure, nor is it intended to limit the scope of the present disclosure. Other features of the present disclosure will be readily understood through the following description.

Description of drawings

Other characteristics, objects and advantages of the present disclosure will become more apparent by reading the detailed description of non-limiting embodiments made with reference to the following drawings:

FIG. 1 is an exemplary system architecture to which the present disclosure may be applied;

FIG. 2 is a flowchart of a method for training a target map model provided by an embodiment of the present disclosure;

FIG. 3 is a flow chart of a method for obtaining a coordinate vector representation of a map location point provided by an embodiment of the present disclosure;

FIG. 4 is a flow chart of a method for acquiring a signboard image representation of a map location point provided by an embodiment of the present disclosure;

FIG. 5 is a flowchart of a positioning method provided by an embodiment of the present disclosure;

FIG. 6 is a schematic diagram of a fusion of a picture sequence and a text sequence provided by an embodiment of the present disclosure;

FIG. 7 is a structural block diagram of a training device for a target map model provided by an embodiment of the present disclosure;

FIG. 8 is a structural block diagram of a positioning device provided by an embodiment of the present disclosure;

FIG. 9 is a schematic structural diagram of an electronic device suitable for performing a target map model training method and/or a positioning method provided by an embodiment of the present disclosure.

Detailed ways

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and they should be regarded as exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness. It should be noted that, in the case of no conflict, the embodiments in the present disclosure and the features in the embodiments can be combined with each other.

In the technical solution of this disclosure, the collection, storage, use, processing, transmission, provision, and disclosure of user personal information involved are all in compliance with relevant laws and regulations, and do not violate public order and good customs.

FIG. 1 shows an exemplary system architecture 100 to which embodiments of the method, device, electronic device, and computer-readable storage medium for training a face recognition model and recognizing faces of the present disclosure can be applied.

As shown in FIG. 1 , a system architecture 100 may include

terminal devices

101 , 102 , 103 , a network 104 and a server 105 . The network 104 is used as a medium for providing communication links between the

terminal devices

101 , 102 , 103 and the server 105 . Network 104 may include various connection types, such as wires, wireless communication links, or fiber optic cables, among others.

Users can use

terminal devices

101 , 102 , 103 to interact with server 105 via network 104 to receive or send messages and the like. Various applications for realizing information communication between the

terminal devices

101, 102, 103 and the server 105 can be installed, such as map retrieval model training applications, map retrieval applications, positioning applications, and the like.

The

terminal devices

101, 102, 103 and the server 105 may be hardware or software. When the

terminal devices

101, 102, 103 are hardware, they can be various electronic devices with display screens, including but not limited to smart phones, tablet computers, laptop computers and desktop computers, etc.; when the

terminal devices

101, 102 When , 103 is software, it can be installed in the electronic devices listed above, and it can be implemented as multiple software or software modules, or can be implemented as a single software or software module, which is not specifically limited here. When the server 105 is hardware, it can be implemented as a distributed server cluster composed of multiple servers, or as a single server; when the server is software, it can be implemented as multiple software or software modules, or as a single software or software The module is not specifically limited here.

The server 105 can provide various services through various built-in applications. Taking a positioning application that can provide positioning services for users as an example, the server 105 can achieve the following effects when running the positioning application: First, the receiving

terminal devices

101, 102 , 103 through the network 104 incoming images for positioning and the current position of the

terminal equipment

101, 102, 103; then, determine the actual coordinate vector expression according to the current position, and determine the actual signboard image according to the image for positioning expression; then, call the target map model to determine the text expression of the shooting position corresponding to the actual coordinate vector expression; next step, call the target map model to determine the alternative text expression sequence corresponding to the actual sign image expression; next, based on the text expression of the shooting position and the distance between each alternative text expression in the alternative text expression sequence, adjust the presentation priority of each alternative text expression in the sequence; finally, pass the alternative text expression sequence after the presentation priority adjustment through the network 104 Return to the

terminal devices

101, 102, 103 so that the user can locate the actual location of the target building according to the results presented by the

terminal devices

101, 102, 103.

Among them, the target map model can be trained by the built-in map retrieval model training application on the server 105 according to the following steps: obtain the text expression, coordinate vector expression, and signboard image expression of each map location point; and the first training sample composed of the expression of the coordinate vector to obtain the first sub-model through training; according to the second training sample composed of the text expression corresponding to the same map position point and the image expression of the signboard, the second sub-model is obtained through training; the fusion of the first sub-model The first sub-model and the second sub-model are used to obtain the target map model.

Since obtaining the target map model for training needs to occupy more computing resources and strong computing power, the training methods for the target map model provided by the subsequent embodiments of the present disclosure are generally provided by those with strong computing power and more computing resources. The server 105 executes, and correspondingly, the training device of the target map model is generally also set in the server 105. However, it should also be pointed out that when the

terminal devices

101, 102, and 103 also have computing power and computing resources that meet the requirements, the

terminal devices

101, 102, and 103 can also complete the training through the training application of the target map model installed on them. The above calculations are performed by the server 105 , and then output the same results as the server 105 . Correspondingly, the training device for the target map model can also be set in the

terminal devices

101 , 102 , 103 . In this case, exemplary system architecture 100 may also exclude server 105 and network 104 .

Of course, the server used to train the target map model may be different from the server used to call the trained target map model. In particular, the target map model trained by the server 105 can also obtain a lightweight target map model suitable for placement in the

terminal devices

101, 102, and 103 through model distillation, that is, it can be flexibly selected according to the recognition accuracy of actual needs Whether to use the lightweight target map model in the

terminal devices

101 , 102 , 103 or choose to use the more complex target map model in the server 105 .

It should be understood that the numbers of terminal devices, networks and servers in Fig. 1 are only illustrative. According to the implementation needs, there can be any number of terminal devices, networks and servers.

Please refer to FIG. 2. FIG. 2 is a flow chart of a method for training a target map model provided by an embodiment of the present disclosure, wherein the process 200 includes the following steps:

Step 201: Obtain the text expression, coordinate vector expression, and signboard image expression of each map location point;

This step aims to obtain the text expression, coordinate vector expression, and signboard image expression of each map location point by the execution subject of the training method of the target map model (for example, the server 105 shown in FIG. 1 ). Wherein, the text expression is exactly this map location point described in text form, such as XX building (building, square, hospital, restaurant, etc.), No. is the vectorized expression of the geographic coordinates of the building) in the real world, and the image representation of the signboard is the image description of the signboard of the object (usually a building) at the location point of the map, that is, the image description of the signboard is used to reflect the location of the building The customized signboard style reflects the strong relationship between the two.

Among them, the coordinate vector expression can convert various forms of coordinates into a vector expression in a variety of vectorized encoding methods, for example, transforming the boundary coordinate sequence of a map position point corresponding to a building into a vector expression, and on this basis, the boundary The coordinate sequence is processed again, such as introducing a geocoding algorithm to convert the boundary coordinate sequence into another form of representation, or simply using the four-corner coordinates of the rectangle that frame the building as the boundary coordinates, and through such as a hash algorithm, etc. The vectorized expression of the boundary coordinates can be obtained by using the method, which is not specifically limited here, and the appropriate processing method can be selected according to the actual application scenario.

Among them, signboard image expression is an expression method that can show the image characteristics of the signboard based on the image captured by the signboard, such as setting shooting resolution, character definition, character image extraction, processing methods, etc.

Step 202: According to the first training sample composed of text expression and coordinate vector expression corresponding to the same map location point, train to obtain the first sub-model;

On the basis of step 201, this step aims to input the text expression corresponding to the same map position point as a sample, output the coordinate vector expression corresponding to the same map position point as a sample, and use the first Training samples are used to train the first sub-model, so that the trained first sub-model can establish the corresponding relationship between the text expression and the coordinate vector expression of the same map position point, so as to facilitate the subsequent matching of the corresponding input information according to the corresponding relationship. Output information. For example, input the coordinate vector of the current location to match the text description of the current location.

Step 203: According to the second training sample composed of text expression and signboard image expression corresponding to the same map location point, train to obtain the second sub-model;

On the basis of step 201, this step aims to input the text expression corresponding to the same map location point as a sample, output the signboard image expression corresponding to the same map location point as a sample, and use the second Training samples to train the second sub-model, so that the trained second sub-model can establish the corresponding relationship between the text expression of the same map location point and the image expression of the signboard, so as to facilitate the subsequent matching of the input information corresponding to the corresponding relationship. Output information. For example, input an image taken of a building signboard, and match the text description of the corresponding building.

Step 204: Fusion of the first sub-model and the second sub-model to obtain the target map model.

On the basis of step 202, this step aims to fuse the first sub-model and the second sub-model by the above-mentioned executive body, so that the text expressions in the first sub-model and the second sub-model can be used as connection points to realize the establishment of the same map position The corresponding relationship between the text expression of the point, the coordinate vector expression and the signboard image expression (such as the correspondence between A, B, and C), so that the final fused target map model can be based on the three One or both of them accurately determine the effect of the other.

It should be noted that the training process of the first sub-model and the second sub-model is carried out independently, but after fusion, it does not mean that the fused initial map model no longer needs to be trained, and often needs to be repeated a few times. Training to make the overall parameters of the fused map model the best overall. For example, first fuse the first sub-model and the second sub-model to obtain the initial map model; then, adjust the parameters of the initial map model until the preset iterative jump-out condition is met, and output the initial map model that meets the iterative jump-out condition as the target map model . Specifically, the iterative jump-out condition set for the fused map model at this time is often different from the iterative jump-out condition set for the first sub-model and the second sub-model, unless the iterative jump-out condition set in some scenarios is Whether the accuracy difference of adjacent iteration results meets the general conditions such as requirements.

Specifically, the first sub-model, the second sub-model, and the fused map model in this embodiment can be implemented using various model frameworks, for example, based on BERT (Bidirectional Encoder Representations from Transformers, which is generally applicable to the field of natural language processing, Chinese literal translation is a two-way encoder representation from a translator) model, and other models with similar effects can also be used, which will not be listed here.

The training method of the target map model provided by the embodiment of the present disclosure is not only based on the text expression of the map position point, but also additionally introduces the coordinate vector expression and the sign image expression during training, so that the pre-trained model in multiple dimensions can make full use of Spatio-temporal big data in the field of maps makes the information contained in the pre-training model more relevant to the real world, and then in practical applications, it can better combine the user's current location and the captured positioning images to obtain more accurate positioning results .

Please refer to FIG. 3. FIG. 3 is a flowchart of a method for obtaining a coordinate vector expression of a map location point provided by an embodiment of the present disclosure, that is, a coordinate vector expression for step 201 in the process 200 shown in FIG. 2 is provided. A specific implementation manner, other steps in the process 200 are not adjusted, and a new complete embodiment is obtained by replacing steps with the specific implementation manner provided in this embodiment. Wherein the process 300 includes the following steps:

Step 301: Acquire the boundary coordinate sequences of each map location point respectively;

Taking buildings as an example, the coordinate sequence of the outer contours of all buildings belonging to the map location point is the boundary coordinate sequence.

Taking a hospital composed of 5 buildings as an example, the geographic coordinate sequence of the outer contours of the three outermost buildings is the boundary coordinate sequence. Among them, the frequency or interval of taking points from the continuous outer contour can be set by yourself.

Step 302: Using the geocoding algorithm and the boundary coordinate sequence, calculate the geocoding set covering the geographic block where the corresponding map location point is located;

On the basis of step 301, this step aims to use the geocoding algorithm and the boundary coordinate sequence to calculate the geocoding set covering the geographic block where the corresponding map location point is located by the above-mentioned execution subject. That is, each geocode in the set of geocodes corresponds to a boundary coordinate in the sequence of boundary coordinates.

Specifically, the geocoding algorithm can specifically choose the Geohash algorithm, Google s2 algorithm, etc. The Geohash algorithm is an address encoding method that encodes two-dimensional spatial longitude and latitude data into a string, and the Google s2 algorithm comes from geometric mathematics. A mathematical symbol S ² , which represents the unit sphere, so the s2 algorithm is designed to solve various geometric problems on the sphere, because the real world is actually a sphere, so it can also be used as an address encoding algorithm .

Step 303: converting the geocode set containing each geocode into a geographic string;

On the basis of step 302, this step aims to convert the geocode set containing each geocode into a geographic character string by the above-mentioned executive body, for example, convert each geocode in the geocode set into a tree structure according to the hierarchical result, and Traversed in a fixed fashion, converting it to a geographic string representing the geographic area within which this map location point falls.

Step 304: Convert the geographic character string into a geographic vector, and express the geographic vector as a coordinate vector of a corresponding map location point.

On the basis of step 303, this step aims to convert the geographic character string into a geographic vector by the execution subject, and express the geographic vector as a coordinate vector of a corresponding map location point.

On the basis of the geographic strings that have been obtained, the conversion rules between geographic strings and geographic vectors can be set by yourself, or a model that can convert the results in vector form can be used, such as inputting geographic strings into preset vector expressions A conversion model; wherein, the vector expression conversion model is used to characterize the corresponding relationship between geographic character strings and geographic vectors, such as convolutional neural network, recurrent neural network, etc.; then, the geographic vector output by the vector representation conversion model is received.

According to the above method, a geographic block vectorized vocabulary can be constructed at the level of country, province, city, district, county, and road. In the prediction stage of the pre-training model, each geographic entity is given its corresponding geographic block vector.

In addition to the above-mentioned way of obtaining the coordinate vector expression of the map location point given in this embodiment, some steps in this embodiment can also be improved or adjusted according to the actual needs in the actual application scene, so as to obtain different This embodiment is more suitable for other implementations required by actual application scenarios.

Please refer to FIG. 4. FIG. 4 is a flow chart of a method for obtaining a signboard image expression of a map location point provided by an embodiment of the present disclosure, that is, for the signboard image expression in step 201 of the process 200 shown in FIG. In a specific implementation mode, other steps in the process 200 are not adjusted, and a new complete embodiment is obtained by replacing other steps with the specific implementation mode provided in this embodiment. Wherein the process 400 includes the following steps:

Step 401: Obtain the signboard image of the building corresponding to each map location point respectively;

This step is aimed at firstly obtaining the image of the signboard of the building corresponding to the location point on the map by the above-mentioned execution subject. And it should be ensured that the same parameters as possible should be kept as much as possible when shooting different signboards, such as shooting equipment, light, angle, resolution, weather, etc., so as to avoid differences between different signboard images.

Step 402: Identify the character part from the signboard image, and cut out the character image corresponding to each character of the character part;

Step 403: Arranging each character image according to the sequence of each character in the character part, and using the obtained character image array as a signboard image representation of the corresponding map location point.

On the basis of step 401, step 402 aims to identify the character part from the signboard image by the above-mentioned executing subject, and cut out the character image corresponding to each character of the character part, and then through step 403, each character image is Sorting is performed to obtain a signboard image expression used to describe the features of the signboard image.

It should be understood that, in addition to the implementation of the character image queue as the image expression of the signboard given in this embodiment, there are many other implementations, such as directly eroding or engraving the signboard image to highlight the characters in the signboard Image processing of image features, and directly express the processed image as the signboard image. The reason why this embodiment chooses to cut out the character image of each character is to correspond to the text expression of the map position point as much as possible, and to construct the corresponding relationship between the character expression of the same character and the image expression of the signboard, so as to improve the relationship between the two. interrelationships.

Further, in order to improve the character recognition effect, before identifying the character part from the signboard image, image abnormality recognition (may include at least one of fuzzy recognition, noise recognition, and skew recognition) can be performed on the signboard image, so that only from Parts of characters identified in signboard images identified as non-anomalous images. Or perform de-abnormal processing on the signboard image identified as an abnormal image, and then try to identify the contained characters therefrom.

The above-mentioned embodiments explain how to train the target map model from various aspects. In order to highlight the effect of the trained target map model from the actual use scene as much as possible, the disclosure also specifically provides a method for using the trained target map model. Map model to solve practical problems, a positioning method can refer to the steps included in the process 500:

Step 501: Obtain the image for positioning and the current position obtained by shooting the signboard of the target building;

This step is aimed at obtaining the image for positioning and the current position obtained by photographing the signboard of the target building by the above-mentioned execution subject. Among them, the target building must be a certain building within the user's field of vision, that is, it is located in the vicinity of the user who initiated the positioning demand, and the current location is the internal positioning component of the device held by the user (such as GPS component, base station interaction component) returns the geographic coordinates.

Step 502: Determine the actual coordinate vector expression according to the current position, and determine the actual signboard image expression according to the positioning image;

On the basis of step 501, this step aims to determine the actual coordinate vector expression according to the current position, and determine the actual signboard image expression according to the positioning image, that is, convert the coordinates into the corresponding vector expression, and convert the image into the corresponding image expression. .

Step 503: call the target map model to determine the text expression of the shooting location corresponding to the actual coordinate vector expression;

On the basis of step 502, this step aims to call the corresponding relationship between the coordinate vector expression and the text expression recorded by the target map model by the above-mentioned execution subject, and determine the text expression of the shooting position corresponding to the actual coordinate vector expression.

Step 504: Calling the target map model to determine an alternative text expression sequence corresponding to the actual signboard image expression;

On the basis of step 502, in this step, the execution subject invokes the corresponding relationship between the signboard image expression and the text expression recorded by the target map model, and determines the alternative text expression sequence corresponding to the actual signboard image expression (the reason is the backup The text expression sequence is selected because the actual signboard images often do not necessarily contain complete image information due to the shooting conditions or influencing factors of the photographer, so in most cases there will be multiple alternative text expressions).

Step 505: Based on the distance between the text expression at the shooting location and each alternative text expression in the alternative text expression sequence, adjust the presentation priority of each alternative text expression in the sequence;

On the basis of

steps

503 and 504, this step aims to adjust the presentation priority of each alternative text expression in the sequence based on the distance between the text expression at the shooting location and each alternative text expression in the alternative text expression sequence Sort. That is, the larger the distance, the lower the priority of the corresponding alternative text expression in the sequence (for example, the lower the position), and on the contrary, the smaller the distance, the lower the priority of the corresponding alternative text expression in the sequence. The higher the presentation priority is (for example, the higher the ranking).

Step 506: Based on the candidate text expression sequence after the presentation priority adjustment, locate the actual location of the target building.

On the basis of step 505, this step aims at locating the actual location of the target building based on the alternative text expression sequence after the presentation priority adjustment.

That is, by presenting the adjusted alternative text expression sequence to the user, the actual location of the target building that needs to be located can be determined more quickly and accurately.

In order to deepen the understanding, the embodiment of the present disclosure also provides a model pre-training method based on the guiding idea of multimodal geographic knowledge enhancement.

Multimodal geographic knowledge enhancement is a model that explicitly learns non-universal text knowledge during the pre-training process by improving the model structure or adding pre-training tasks. Specifically, the map usage scenario targeted by this embodiment utilizes geographic domain data in three modalities: text, geographic coordinates, and signboard images. During the pre-training phase of the model, multi-tasks are used to change the structure of the model. Fully relevant geographic knowledge of the real world is incorporated into pre-trained models. The main parts of the above model pre-training method include: integrating geographic coordinate information into the model, and multimodal geographic information fusion learning.

1. Integrating geographic coordinate information into the model

As input for model training, most texts representing geographic entities can be accurately associated with their real-world counterparts in real geographic blocks. Therefore, in the existing model (take the pre-training model architecture BERT as an example, it converts each character of the received plain text into a word vector (Token Embeddings), a separator vector (Segment Embeddings), a position vector (POSITION Embeddings) ) superimposed representation, and send it to the subsequent semantic representation layer such as transformer for context modeling, and finally use the vector obtained from semantic representation layer modeling to perform pre-training tasks such as masked language model (Masked Language Model) for training) In the character representation layer, geographic coordinate vectors (GEO Embeddings) are added and fused with word vectors, separator vectors, and position vectors.

In order to integrate image features and text features, the input can be divided into text sequence (text sequence of XXX Eye Hospital, No. 18, Chuanhui Road, Shanghai) and picture sequence (character image sequence of XXX Eye Hospital) as shown in Figure 6. , the text sequence is trained using the above-mentioned improved model structure that additionally introduces geographic coordinate vectors, that is, the Embed process of the text sequence represents the text feature sequence generated before inputting the transformer layer, which incorporates geographic location information, and the image sequence is used The pre-trained single-character text detection model recognizes the picture of each word on the signboard image, and the Embed of a picture sequence represents the use of existing picture feature extraction models (such as ResNet, residual network) to process the output picture sequence The extracted image features represent the fusion of sequences and geographic coordinate vectors. TRM stands for transformer layer. Co-TRM represents information interaction performed by two different modalities.

2. Multimodal geographic information fusion learning:

After getting the picture and text representation, the model is pre-trained with two tasks:

1) Signature Image Masking Task: Overall, occlude 15% of the input text sequence and image area, and let the model predict the occluded part given the remaining input. For text sequences, the masking method adopts the classic MLM (Masked Language Model, masked language model) method. For image sequences, 90% of the selected image area is set to 0, and 10% of the area is unchanged. The image features of the regions are 90% zeroed and 10% unchanged. Use the text recognition probability distribution obtained by optical character recognition technology as the label of the image, and let the model predict the same distribution, and finally use the KL divergence (relative entropy) between the two distributions as a supervisory signal to train the image side.

2) Signboard text matching task: Given a given text sequence and a signboard picture sequence, predict whether the description of the text is consistent with the expression in the signboard picture.

That is to say, this embodiment utilizes the geographical field data of three modalities of text, geographic coordinates, and signboard images at the same time, and through multi-task in the pre-training stage of the model, changes the way of the model structure, and integrates the geographical knowledge fully related to the real world into the The pre-trained model models more complete spatio-temporal semantics for downstream tasks, so as to achieve the effect of improving various related functions such as search in map products.

Further referring to FIG. 7 and FIG. 8, as the realization of the methods shown in the above figures, the present disclosure provides an embodiment of a training device for a target map model and an embodiment of a positioning device respectively, and the training device for a target map model implements The example corresponds to the embodiment of the training method for the target map model shown in FIG. 2 , and the embodiment of the positioning device corresponds to the embodiment of the positioning method. The above device can be specifically applied to various electronic devices.

As shown in FIG. 7 , the target map model training apparatus 700 of this embodiment may include: a parameter acquisition unit 701 , a first sub-model training unit 702 , a second sub-model training unit 703 , and a sub-model fusion unit 704 . Among them, the parameter acquisition unit 701 is configured to acquire the text expression, coordinate vector expression, and signboard image expression of each map location point; the first sub-model training unit 702 is configured to obtain the text expression and coordinates corresponding to the same map location point The first training sample composed of vector expressions is trained to obtain the first sub-model; the second sub-model training unit 703 is configured to train the second training sample composed of text expressions and signboard image expressions corresponding to the same map location points to obtain The second sub-model; the sub-model fusion unit 704 is configured to fuse the first sub-model and the second sub-model to obtain the target map model.

In this embodiment, in the target map model training device 700: the specific processing of the parameter acquisition unit 701, the first sub-model training unit 702, the second sub-model training unit 703, and the sub-model fusion unit 704 and the resulting For the technical effects, reference may be made to the related descriptions of steps 201-204 in the embodiment corresponding to FIG. 2 , which will not be repeated here.

In some optional implementations of this embodiment, the parameter acquisition unit 701 may include a coordinate vector expression acquisition subunit configured to acquire the coordinate vector expression of each map location point, and the coordinate vector expression acquisition subunit may include:

The boundary coordinate sequence acquisition module is configured to respectively acquire the boundary coordinate sequences of each map location point;

The geocoding set calculation module is configured to use the geocoding algorithm and the boundary coordinate sequence to calculate the geocoding set covering the geographic block where the corresponding map location point is located;

a geographic string conversion module configured to convert a geocode set containing each geocode into a geographic string;

The geographic vector conversion module is configured to convert the geographic character string into a geographic vector, and express the geographic vector as a coordinate vector of a corresponding map location point.

In some optional implementations of this embodiment, the geographic vector conversion module may be further configured to:

Inputting geographic character strings into a preset vector expression transformation model; wherein, the vector expression transformation model is used to characterize the correspondence between geographic character strings and geographic vectors;

Receives a vector representing the geographic vector output by the transformed model.

In some optional implementations of this embodiment, the parameter acquiring unit 701 may include a signboard image expression acquiring subunit configured to acquire the signboard image expression of each map location point, and the signboard image expression acquiring subunit may include:

The signboard image acquisition module is configured to respectively acquire the signboard images of the buildings corresponding to each map location point;

The character recognition and character image cutting module is configured to recognize the character part from the signboard image, and cut out a character image corresponding to each character of the character part;

The character image sorting module is configured to arrange the character images according to the character sequence of the character part, and use the obtained character image queue as a signboard image expression of the corresponding map location point.

In some optional implementations of this embodiment, the signboard image expression acquisition subunit may also include:

The abnormality recognition module is configured to perform image abnormality recognition on the signboard image before identifying the character part from the signboard image; wherein, the image abnormality recognition includes at least one of fuzzy recognition, noise recognition, and skew recognition;

Correspondingly, the character recognition sub-module in the character recognition and character image cutting module can be further configured as:

Character parts were only identified from signboard images identified as non-anomalous images.

In some optional implementations of this embodiment, the sub-model fusion unit 704 may be further configured to:

fusing the first sub-model and the second sub-model to obtain an initial map model;

Adjust the parameters of the initial map model until the preset iterative jump-out condition is met, and output the initial map model that meets the iterative jump-out condition as the target map model:

As shown in FIG. 8 , the positioning device 800 of this embodiment may include: an image for positioning and a current position acquisition unit 801, an actual coordinate vector expression and an actual signboard image expression determination unit 802, a shooting location text expression determination unit 803, an alternative text An expression sequence determination unit 804 and a presentation priority sorting adjustment unit 805 . Among them, the positioning image and the current position acquisition unit 801 is configured to acquire the positioning image and the current position obtained by photographing the signboard of the target building; the actual coordinate vector expression and actual signboard image expression determination unit 802 is configured to Determine the actual coordinate vector expression for the position, and determine the actual signboard image expression according to the positioning image; the shooting position text expression determination unit 803 is configured to call the target map model to determine the shooting position text expression corresponding to the actual coordinate vector expression; alternative text expression sequence The determination unit 804 is configured to call the target map model to determine the candidate text expression sequence corresponding to the actual sign image expression; the presentation priority ranking adjustment unit 805 is configured to be based on each candidate text expression in the shooting location text expression and the candidate text expression sequence Select the distance between the text expressions, and adjust the presentation priority of each candidate text expression in the sequence; the actual position determination unit 806 is configured to locate the target building based on the candidate text expression sequence after the presentation priority adjustment The actual position of ; wherein, the target map model is obtained according to the training device 700 of the mapped map model.

In this embodiment, in the positioning device 700: the image to be positioned and the current position acquisition unit 801, the actual coordinate vector expression and the actual signboard image expression determination unit 802, the shooting position text expression determination unit 803, and the alternative text expression sequence determination unit 804. The specific processing of the presentation priority adjusting unit 805 and the technical effects brought about by it may respectively correspond to relevant descriptions in the method embodiments, and will not be repeated here.

This embodiment exists as a device embodiment corresponding to the above-mentioned method embodiment. The training device and positioning device of the target map model provided by this embodiment are not only based on the text expression of the map position point during training, but also additionally introduce the coordinate vector expression And signage image expression, so that the pre-trained model in multiple dimensions can make full use of the spatio-temporal big data in the map field, so that the information contained in the pre-trained model can be more related to the real world, so that it can better integrate users in practical applications The current position of the camera and the positioning image obtained by shooting can obtain more accurate positioning results.

According to an embodiment of the present disclosure, the present disclosure also provides an electronic device, the electronic device includes: at least one processor; and a memory connected in communication with the at least one processor; wherein, the memory stores information executable by the at least one processor. instructions, the instructions are executed by at least one processor, so that the at least one processor can implement the training method and/or positioning method for the target map model described in any of the above embodiments when executed.

According to an embodiment of the present disclosure, the present disclosure also provides a readable storage medium, the readable storage medium stores computer instructions, and the computer instructions are used to enable the computer to implement the target map model described in any of the above embodiments. training methods and/or positioning methods.

An embodiment of the present disclosure provides a computer program product. When the computer program is executed by a processor, the method for training the target map model and/or the positioning method described in any of the above embodiments can be implemented.

FIG. 9 shows a schematic block diagram of an example electronic device 900 that may be used to implement embodiments of the present disclosure. Electronic device is intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other suitable computers. Electronic devices may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are by way of example only, and are not intended to limit implementations of the disclosure described and/or claimed herein.

As shown in FIG. 9 , the device 900 includes a computing unit 901 that can execute according to a computer program stored in a read-only memory (ROM) 902 or loaded from a storage unit 908 into a random-access memory (RAM) 903. Various appropriate actions and treatments. In the RAM 903, various programs and data necessary for the operation of the device 900 can also be stored. The computing unit 901, ROM 902, and RAM 903 are connected to each other through a bus 904. An input/output (I/O) interface 905 is also connected to the bus 904 .

Multiple components in the device 900 are connected to the I/O interface 905, including: an input unit 906, such as a keyboard, a mouse, etc.; an output unit 907, such as various types of displays, speakers, etc.; a storage unit 908, such as a magnetic disk, an optical disk, etc. ; and a communication unit 909, such as a network card, a modem, a wireless communication transceiver, and the like. The communication unit 909 allows the device 900 to exchange information/data with other devices over a computer network such as the Internet and/or various telecommunication networks.

The computing unit 901 may be various general-purpose and/or special-purpose processing components having processing and computing capabilities. Some examples of computing units 901 include, but are not limited to, central processing units (CPUs), graphics processing units (GPUs), various dedicated artificial intelligence (AI) computing chips, various computing units that run machine learning model algorithms, digital signal processing processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 901 executes various methods and processes described above, such as a training method and/or a positioning method of a target map model. For example, in some embodiments, the object map model training method and/or localization method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 908 . In some embodiments, part or all of the computer program may be loaded and/or installed on the device 900 via the ROM 902 and/or the communication unit 909 . When the computer program is loaded into RAM 903 and executed by computing unit 901, one or more steps of the training method and/or positioning method of the target map model described above can be performed. Alternatively, in other embodiments, the computing unit 901 may be configured in any other appropriate way (for example, by means of firmware) to execute a target map model training method and/or positioning method.

Various implementations of the systems and techniques described above herein can be implemented in digital electronic circuit systems, integrated circuit systems, field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), application specific standard products (ASSPs), systems on chips Implemented in a system of systems (SOC), load programmable logic device (CPLD), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include being implemented in one or more computer programs executable and/or interpreted on a programmable system including at least one programmable processor, the programmable processor Can be special-purpose or general-purpose programmable processor, can receive data and instruction from storage system, at least one input device, and at least one output device, and transmit data and instruction to this storage system, this at least one input device, and this at least one output device an output device.

Program codes for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general-purpose computer, a special purpose computer, or other programmable data processing devices, so that the program codes, when executed by the processor or controller, make the functions/functions specified in the flow diagrams and/or block diagrams Action is implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of the present disclosure, a machine-readable medium may be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, apparatus, or device. A machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination of the foregoing. More specific examples of machine-readable storage media would include one or more wire-based electrical connections, portable computer disks, hard disks, Random Access Memory (RAM), Read Only Memory (ROM), Erasable Programmable Read Only Memory (EPROM or flash memory), optical fiber, compact disk read only memory (CD-ROM), optical storage, magnetic storage, or any suitable combination of the foregoing.

To provide for interaction with the user, the systems and techniques described herein can be implemented on a computer having a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user. ); and a keyboard and pointing device (eg, a mouse or a trackball) through which a user can provide input to the computer. Other kinds of devices can also be used to provide interaction with the user; for example, the feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and can be in any form (including Acoustic input, speech input or, tactile input) to receive input from the user.

The systems and techniques described herein can be implemented in a computing system that includes back-end components (e.g., as a data server), or a computing system that includes middleware components (e.g., an application server), or a computing system that includes front-end components (e.g., as a a user computer having a graphical user interface or web browser through which a user can interact with embodiments of the systems and techniques described herein), or including such backend components, middleware components, Or any combination of front-end components in a computing system. The components of the system can be interconnected by any form or medium of digital data communication, eg, a communication network. Examples of communication networks include: Local Area Network (LAN), Wide Area Network (WAN) and the Internet.

A computer system may include clients and servers. Clients and servers are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also known as cloud computing server or cloud host, which is a host product in the cloud computing service system to solve the management difficulties in traditional physical host and virtual private server (VPS, Virtual Private Server) services Large and weak business expansion.

The technical solutions of the embodiments of the present disclosure not only based on the text expression of map location points, but also additionally introduce coordinate vector expression and signboard image expression during training, so that the pre-trained model in multiple dimensions can make full use of the spatiotemporal space of the map field. The data makes the information contained in the pre-training model more relevant to the real world, and in actual application, it can better combine the user's current location and the captured positioning images to obtain more accurate positioning results.

It should be understood that steps may be reordered, added or deleted using the various forms of flow shown above. For example, each step described in the present disclosure may be executed in parallel, sequentially, or in a different order, as long as the desired result of the technical solution disclosed in the present disclosure can be achieved, no limitation is imposed herein.

The specific implementation manners described above do not limit the protection scope of the present disclosure. It should be apparent to those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made depending on design requirements and other factors. Any modifications, equivalent replacements and improvements made within the spirit and principles of the present disclosure shall be included within the protection scope of the present disclosure.

Claims

A training method for a target map model, comprising:

Obtain the text expression, coordinate vector expression, and signboard image expression of each map location point;

training to obtain a first sub-model according to a first training sample composed of a text expression corresponding to the same map position point and a coordinate vector expression;

training to obtain a second sub-model according to a second training sample composed of a text expression corresponding to the same map location point and a signboard image expression;

The first sub-model and the second sub-model are fused to obtain a target map model.
The method according to claim 1, wherein obtaining the coordinate vector expression of each said map location point comprises:

Respectively obtain the boundary coordinate sequence of each said map location point;

Using a geocoding algorithm and the boundary coordinate sequence to calculate a geocoding set covering the geographic block where the corresponding map location point is located;

convert a geocode collection containing each geocode into a geostring;

The geographic string is converted into a geographic vector, and the geographic vector is expressed as a coordinate vector of a corresponding map location point.
The method according to claim 2, wherein said converting said geographic character string into a geographic vector comprises:

Inputting the geographic character string into a preset vector expression conversion model; wherein, the vector expression conversion model is used to characterize the correspondence between geographic character strings and geographic vectors;

Receives the vector representation to transform the model output to a geographic vector.
The method according to claim 1, wherein obtaining the signboard image expression of each said map location point comprises:

Respectively acquire the signboard image of the building corresponding to each said map location point;

identifying a character part from the signboard image, and cutting out a character image corresponding to each character of the character part;

The character images are arranged according to the order of the characters in the character part, and the obtained character image array is expressed as a signboard image of a corresponding map location point.
The method according to claim 4, wherein, before identifying the character part from the signboard image, further comprising:

Perform image abnormality recognition on the signboard image; wherein, the image abnormality recognition includes at least one of fuzzy recognition, noise recognition, and skew recognition;

Correspondingly, the identifying the character part from the signboard image includes:

The character portion was identified only from signboard images identified as non-anomalous images.
The method according to any one of claims 1-5, wherein said merging said first sub-model and second sub-model to obtain a target map model comprises:

fusing the first sub-model and the second sub-model to obtain an initial map model;

Adjusting the parameters of the initial map model until a preset iterative jump-out condition is met, and outputting the initial map model satisfying the iterative jump-out condition as the target map model.
A positioning method, comprising:

Obtain the image for positioning and the current position obtained by photographing the signboard of the target building;

determining the actual coordinate vector expression according to the current position, and determining the actual signboard image expression according to the positioning image;

Calling the target map model to determine the shooting position text expression corresponding to the actual coordinate vector expression; wherein, the target map model is obtained according to the training method of the target map model described in any one of claims 1-6;

Calling the target map model to determine an alternative text expression sequence corresponding to the actual signboard image expression;

Adjusting the presentation priority of each of the alternative text expressions in the sequence based on the distance between the text expression at the shooting location and each alternative text expression in the alternative text expression sequence;

The actual location of the target building is located based on the candidate text expression sequence after the presentation priority adjustment.
A training device for a target map model, comprising:

The parameter acquisition unit is configured to acquire the text expression, coordinate vector expression, and signboard image expression of each map location point;

The first sub-model training unit is configured to train the first sub-model according to the first training sample composed of the text expression and the coordinate vector expression corresponding to the same map location point;

The second sub-model training unit is configured to train a second sub-model according to a second training sample composed of a text expression corresponding to the same map location point and a signboard image expression;

The sub-model fusion unit is configured to fuse the first sub-model and the second sub-model to obtain a target map model.
The device according to claim 8, wherein the parameter acquisition unit includes a coordinate vector expression acquisition subunit configured to acquire the coordinate vector expression of each of the map position points, and the coordinate vector expression acquisition subunit includes:

The boundary coordinate sequence acquisition module is configured to respectively acquire the boundary coordinate sequence of each said map location point;

The geocoding set calculation module is configured to use the geocoding algorithm and the boundary coordinate sequence to calculate the geocoding set covering the geographic block where the corresponding map location point is located;

a geographic string conversion module configured to convert a geocode set containing each geocode into a geographic string;

The geographic vector conversion module is configured to convert the geographic character string into a geographic vector, and express the geographic vector as a coordinate vector of a corresponding map location point.
The apparatus of claim 9, wherein the geographic vector conversion module is further configured to:

Inputting the geographic character string into a preset vector expression conversion model; wherein, the vector expression conversion model is used to characterize the correspondence between geographic character strings and geographic vectors;

Receives the vector representation to transform the model output to a geographic vector.
The device according to claim 8, wherein the parameter acquisition unit comprises a signboard image expression acquisition subunit configured to acquire the signboard image expression of each of the map location points, the signboard image expression acquisition subunit comprising:

The signboard image acquisition module is configured to respectively acquire the signboard images of the buildings corresponding to the map location points;

The character recognition and character image cutting module is configured to recognize a character part from the signboard image, and cut out a character image corresponding to each character of the character part;

The character image sorting module is configured to arrange the character images according to the character sequence of the character part, and use the obtained character image queue as a signboard image representation of the corresponding map location point.
The device according to claim 11, wherein the signboard image expression acquisition subunit further comprises:

The abnormality recognition module is configured to perform image abnormality recognition on the signboard image before the character part is recognized from the signboard image; wherein, the image abnormality recognition includes at least one of fuzzy recognition, noise recognition, and skew recognition one item;

Correspondingly, the character recognition submodule in the character recognition and character image cutting module is further configured to:

The character portion was identified only from signboard images identified as non-anomalous images.
The device according to any one of claims 8-12, wherein the sub-model fusion unit is further configured to:

fusing the first sub-model and the second sub-model to obtain an initial map model;

Adjusting the parameters of the initial map model until a preset iterative jump-out condition is met, and outputting the initial map model satisfying the iterative jump-out condition as the target map model.
A positioning device, comprising:

An image for positioning and a current position acquisition unit configured to acquire an image for positioning and a current position obtained by photographing a signboard of the target building;

The actual coordinate vector expression and actual signboard image expression determining unit is configured to determine the actual coordinate vector expression according to the current position, and determine the actual signboard image expression according to the positioning image;

The shooting position text expression determination unit is configured to call the target map model to determine the shooting position text expression corresponding to the actual coordinate vector expression; wherein, the target map model is according to the target described in any one of claims 8-13 The training device of the map model is obtained;

An alternative text expression sequence determination unit configured to invoke the target map model to determine an alternative text expression sequence corresponding to the actual signboard image expression;

The presentation priority sorting adjustment unit is configured to adjust the presentation of each of the candidate text expressions in the sequence based on the distance between the shooting position text expression and each candidate text expression in the candidate text expression sequence prioritization;

The actual location determining unit is configured to locate the actual location of the target building based on the presentation priority-adjusted candidate text expression sequence.
An electronic device comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

The memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor, so that the at least one processor can perform any one of claims 1-6. The training method of the target map model and/or the positioning method described in claim 7.
A non-transitory computer-readable storage medium storing computer instructions, the computer instructions are used to make the computer execute the method for training the target map model according to any one of claims 1-6 and/or claim 7 The positioning method described.
A computer program product, comprising a computer program, when the computer program is executed by a processor, it realizes the steps of the method for training the target map model according to any one of claims 1-6 and/or the positioning method according to claim 7 A step of.