CN111862213A

CN111862213A - Positioning method and device, electronic equipment and computer readable storage medium

Info

Publication number: CN111862213A
Application number: CN202010742468.3A
Authority: CN
Inventors: 樊欢欢; 李姬俊男; 马标
Original assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Current assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority date: 2020-07-29
Filing date: 2020-07-29
Publication date: 2020-10-30

Abstract

The embodiment of the application discloses a positioning method, which comprises the following steps: acquiring an image to be positioned, and performing pose identification processing on the image to be positioned to obtain initial pose information of the electronic equipment when the image to be positioned is acquired; determining a semantic consistency weight value of a reference image corresponding to an image to be positioned; the reference image refers to an image which is acquired by the electronic equipment and is adjacent to the image to be positioned; the semantic consistency weight value is used for representing the similarity between reference semantic information of a reference image and first semantic information of a first image; the first image is an image matched with the reference image in a scene map of a scene where the electronic equipment is located; and if the semantic consistency weight value of the reference image is greater than the weight threshold, determining target pose information of the electronic equipment when the to-be-positioned image is acquired based on the relevant information and the initial pose information of the reference image. The embodiment of the application also discloses a positioning device, electronic equipment and a computer readable storage medium.

Description

Positioning method and device, electronic equipment and computer readable storage medium

Technical Field

The present application relates to the field of image processing technologies, and in particular, to a positioning method and apparatus, an electronic device, and a computer-readable storage medium.

Background

The instant positioning and mapping (SLAM) technology is a technology for electronic equipment to realize self positioning and environment mapping according to sensor information, and is widely applied to the fields of robots, unmanned planes, unmanned driving, virtual reality and the like.

At present, the positioning result of the traditional SLAM positioning method is influenced by illumination change, environmental change and cultural change, and the problem of positioning failure can occur under the conditions of high illumination intensity, weak image culture and the like. Although the positioning method based on the neural network can estimate the pose information of the electronic equipment, the error between the pose information estimated based on the neural network and the pose information of the actual electronic equipment is large, and the practical application requirements are difficult to meet.

Disclosure of Invention

The embodiment of the application provides a positioning method and device, electronic equipment and a computer readable storage medium.

The technical scheme of the embodiment of the application is realized as follows:

the embodiment of the application provides a positioning method, which is applied to electronic equipment and comprises the following steps:

acquiring an image to be positioned, and performing pose identification processing on the image to be positioned to obtain initial pose information of the electronic equipment when the image to be positioned is acquired;

determining a semantic consistency weight value of a reference image corresponding to the image to be positioned; the reference image refers to an image which is acquired by the electronic equipment and adjacent to the image to be positioned;

the semantic consistency weight value is used for representing the similarity between the reference semantic information of the reference image and the first semantic information of the first image; the first image is an image matched with the reference image in a scene map of a scene where the electronic equipment is located;

and if the semantic consistency weight value of the reference image is greater than a weight threshold, determining target pose information of the electronic equipment when the image to be positioned is acquired based on the relevant information of the reference image and the initial pose information.

The embodiment of the application provides a bit device, is applied to in the electronic equipment, the device includes:

the acquisition unit is used for acquiring an image to be positioned;

the first processing unit is used for carrying out pose identification processing on the image to be positioned to obtain initial pose information of the electronic equipment when the image to be positioned is collected;

the determining unit is used for determining a semantic consistency weight value of a reference image corresponding to the image to be positioned; the reference image refers to an image which is acquired by the electronic equipment and adjacent to the image to be positioned;

and the second processing unit is used for determining target pose information of the electronic equipment when the image to be positioned is acquired based on the relevant information of the reference image and the initial pose information if the semantic consistency weight value of the reference image is greater than a weight threshold value.

An embodiment of the application provides an electronic device comprising a processor and a memory for storing a computer program capable of running on the processor;

the processor and the memory are connected through a communication bus;

wherein the processor executes the steps of the positioning method when executing the computer program stored in the memory.

Embodiments of the present application further provide a computer-readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to implement the method provided by the embodiments of the present application.

The embodiment of the application has the following beneficial effects:

the positioning method provided by the embodiment of the application comprises the steps of firstly, acquiring an image to be positioned, and carrying out pose identification processing on the image to be positioned to obtain initial pose information of the electronic equipment when the image to be positioned is acquired; then, determining a semantic consistency weight value of a reference image corresponding to the image to be positioned; the reference image refers to an image which is acquired by the electronic equipment and adjacent to the image to be positioned; the semantic consistency weight value is used for representing the similarity between the reference semantic information of the reference image and the first semantic information of the first image; the first image is an image matched with the reference image in a scene map of a scene where the electronic equipment is located; and finally, determining target pose information of the electronic equipment when the image to be positioned is acquired based on the relevant information of the reference image and the initial pose information when the semantic consistency weight value of the reference image is detected to be larger than a weight threshold value. Therefore, when the semantic consistency weight value of the reference image is larger than the weight threshold, the electronic equipment can optimize and adjust the initial pose information corresponding to the image to be positioned based on the relevant information of the reference image to obtain accurate target pose information.

Drawings

Fig. 1 is a first schematic flowchart of a positioning method according to an embodiment of the present disclosure;

fig. 2 is a second schematic flowchart of a positioning method according to an embodiment of the present application;

FIG. 3 is a schematic diagram illustrating a training process of an exemplary image processing model according to an embodiment of the present disclosure;

fig. 4 is a third schematic flowchart of a positioning method according to an embodiment of the present application;

FIG. 5 is a schematic diagram of a reprojection error calculation according to an embodiment of the present disclosure;

fig. 6 is a schematic view of a scene map construction process provided in an embodiment of the present application;

fig. 7 is a fourth schematic flowchart of a positioning method according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of a positioning device according to an embodiment of the present disclosure;

fig. 9 is a schematic diagram of a hardware component structure of an electronic device according to an embodiment of the present application.

Detailed Description

So that the manner in which the features and elements of the present invention can be understood in detail, a more particular description of the invention, briefly summarized above, may be had by reference to embodiments, some of which are illustrated in the appended drawings.

Unless defined otherwise, all technical and scientific terms used in the examples of this application have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used in the embodiments of the present application is for the purpose of describing the embodiments of the present application only and is not intended to be limiting of the present application.

For the problems in the related art, an embodiment of the present application provides a positioning method, where an execution subject of the positioning method may be the positioning apparatus provided in the embodiment of the present application, or an electronic device integrated with the positioning apparatus, where the positioning apparatus may be implemented in a hardware or software manner. The electronic device includes, but is not limited to, a smart phone, an Augmented Reality (AR) device, an intelligent robot, a notebook computer, an intelligent wearable device, and the like. Here, the electronic device may implement a positioning navigation function of the electronic device in various scenes (e.g., indoor scenes) through the positioning method provided in the embodiment of the present application.

Referring to fig. 1, fig. 1 is a first flowchart illustrating a positioning method according to an embodiment of the present disclosure, and as shown in fig. 1, the positioning method includes steps 110 to 130. Wherein,

and 110, acquiring an image to be positioned, and performing pose identification processing on the image to be positioned to obtain initial pose information of the electronic equipment when the image to be positioned is acquired.

In the embodiment provided by the application, the electronic device can acquire multiple frames of continuous images or a section of continuous video stream in the current scene through the camera, process each frame of image in the acquired multiple frames of continuous images or continuous video stream, and obtain the pose information of the electronic device when acquiring each frame of image, so as to determine the accurate pose information of the electronic device in the scene map of the current scene.

It should be noted that the electronic device may obtain a scene map of a current scene from the cloud server, and the electronic device may also pre-construct a scene map of a scene, where different scenes may correspond to different scene maps; each scene map includes environmental features of the scene, such as walls, pillars, etc. Here, the scene map may be a three-dimensional point cloud map, i.e. the map is represented by a plurality of sets of discrete point cloud data; the scene map may also be a two-dimensional map, i.e. the map is represented by a multi-frame red, green and blue (RGB) image. The embodiment of the present application does not limit the type of the scene map.

Therefore, under the condition that the electronic equipment is located in an unfamiliar environment, the electronic equipment can be compared with a scene map of the current scene by acquiring images of the surrounding environment, so that the pose of the electronic equipment in the current scene map can be determined, and positioning and navigation are achieved.

Here, the image to be positioned referred to in the embodiment of the present application is any one of the above-described continuous multi-frame images.

In a possible implementation manner, the electronic device may process the image to be positioned through a pre-trained neural network model, so as to obtain initial pose information of the electronic device when the image to be positioned is acquired.

In another possible implementation manner, the electronic device may perform feature matching on the image to be positioned and information in the scene map to obtain initial pose information of the electronic device when the positioning image is acquired.

The embodiment of the application does not limit the mode of determining the initial pose information of the electronic equipment when the positioning image is acquired.

Step 120, determining a semantic consistency weight value of a reference image corresponding to the image to be positioned; the reference image refers to an image which is acquired by the electronic equipment and is adjacent to the image to be positioned.

The semantic consistency weight value is used for representing the similarity between reference semantic information of a reference image and first semantic information of a first image; the first image refers to an image which is matched with the reference image in a scene map of a scene where the electronic equipment is located.

In practical application, the pose information acquired in the scene map is determined to be inaccurate only by the information of one frame of image to be positioned, so in the embodiment of the application, the electronic device can also acquire a reference image which has an association relation with the image to be positioned, and the target pose information corresponding to the current image to be positioned is determined by combining the relevant information of the reference image.

In the embodiment provided by the application, the electronic device can acquire continuous multi-frame images in the current scene through the camera, and the reference image is an image adjacent to the image to be positioned in the continuous multi-frame images. For example, the reference image may be a previous frame image of the image to be positioned, or the reference image may be a next frame image of the image to be positioned.

Here, the electronic device may determine the similarity between the reference image and a specific image in the scene map, and if the similarity between the reference image and the specific image is higher, the electronic device may indicate that the correlation between the reference image and the scene map is higher, and conversely, the correlation between the reference image and the scene map is lower. Therefore, the initial pose information of the image to be positioned is adjusted based on the related information of the reference image with high correlation with the scene map, and the precision of the pose information of the image to be positioned can be further improved.

It should be noted that the specific image is the first image mentioned in step 120, and the first image is an image matching the reference image in the scene map.

In an embodiment provided by the application, the electronic device may measure the similarity between the reference image and the first image by using a semantic consistency weight value of the reference image.

Specifically, the electronic device may determine reference semantic information of the reference image and first semantic information of the first image, and further calculate a similarity between the reference semantic information and the first semantic information to obtain a semantic consistency weight value of the reference image. The semantic information refers to the category of each pixel point in the image.

It can be understood that the higher the semantic consistency weight value of the reference image, the higher the correlation between the reference image and the scene map, and the lower the semantic consistency weight value of the reference image, the lower the correlation between the reference image and the scene map.

And step 130, if the semantic consistency weight value of the reference image is greater than the weight threshold, determining target pose information of the electronic equipment when the image to be positioned is acquired based on the relevant information and the initial pose information of the reference image.

In the embodiment provided by the application, the electronic device may determine a relationship between a semantic consistency weight value of a reference image and a weight threshold, and if the semantic consistency weight value of the reference image is greater than the weight threshold, the correlation between the reference image and a scene map is considered to be large, otherwise, if the semantic consistency weight value of the reference image is less than or equal to the weight threshold, the correlation between the reference image and the scene map is considered to be small.

Therefore, when the semantic consistency weight value of the reference image is larger than the weight threshold, the electronic equipment can optimize and adjust the initial pose information corresponding to the image to be positioned based on the relevant information of the reference image to obtain accurate target pose information.

In the embodiments provided in the present application, the related information of the reference image may include: reference pose information of the reference image, and/or key feature points of the reference image. Here, the key feature point of the reference image refers to a feature point of a key region in the reference image, and for example, the key feature point may be a point with strong texture or a point with large color change in the reference image.

Therefore, the positioning method provided by the embodiment of the application can determine the semantic consistency weight value of the reference image corresponding to the image to be positioned, and obtains the target pose information of the electronic equipment when the current image to be positioned is acquired by using the relevant information of the reference image with the larger semantic consistency weight value, so that the pose calculation accuracy is improved.

In the following, the process of determining the semantic consistency weight value of the reference image corresponding to the image to be located in step 120 is exemplarily described.

Referring to a second flowchart of a positioning method shown in fig. 2, step 120 may be implemented by steps 1201 to step 1203. Wherein,

step 1201, obtaining reference pose information and reference semantic information corresponding to the reference image.

Here, the electronic device may perform pose recognition processing and semantic segmentation processing on the reference image to obtain reference pose information of the electronic device when the reference image is acquired and reference semantic information of the reference image.

In one possible implementation, step 1201 may be implemented as follows:

and performing pose recognition processing and semantic segmentation processing on the reference image through the image processing model to obtain reference pose information of the electronic equipment when the reference image is acquired and reference semantic information of the reference image.

In embodiments provided herein, the image processing model may be a pre-trained neural network model. The image processing model may be obtained by training a sample image by the electronic device, or may be obtained by the electronic device from a cloud server providing the image processing model.

Here, the image processing model is used to determine pose information of the electronic device at the time of capturing the image, and semantic information of the image. That is to say, the image processing model may be a multitasking model, that is, the image processing model may process the same image to obtain two different types of processing results, that is, pose information when acquiring the image and semantic information of the image, so that the efficiency of image processing may be improved.

In one possible implementation, the image processing model may include convolutional layers, temporal cyclic neural networks, and semantic segmentation networks.

Specifically, the above-mentioned performing pose recognition processing and semantic segmentation processing on the reference image through the image processing model to obtain the reference pose information of the electronic device when the reference image is acquired and the reference semantic information of the reference image may specifically be implemented through the following steps:

step 1201a, extracting the features of the reference image through the convolution layer to obtain the reference image features of the reference image;

step 1201b, based on the related information of the previous frame of image adjacent to the reference image, performing pose identification processing on the reference image features through a time cycle neural network to obtain reference pose information of the electronic equipment when the reference image is collected;

step 1201c, performing semantic segmentation processing on the reference image features through a semantic segmentation network to obtain reference semantic information of the reference image.

It is understood that the convolutional layer is used for convolution processing of the reference image, i.e. the convolutional layer can extract the image features of the reference image to obtain the reference image features.

And the time cycle neural network is used for determining the pose information of the electronic equipment when the image is acquired. In the embodiment provided by the application, the electronic device can take the related information of the previous frame image adjacent to the reference image and the extracted reference image characteristics as the input of the time-cycle neural network, and obtain the reference pose information of the electronic device when the reference image is acquired through the processing of the time-cycle neural network.

In one possible implementation, the time-cycled neural network may be a long short-Term Memory network (LSTM).

In addition, the semantic segmentation network is specifically used for determining semantic information of the image. In the embodiment provided by the application, the electronic device may use the reference image features obtained by the convolutional layer as input of the semantic segmentation network, and obtain the reference semantic information of the reference image through processing of the semantic segmentation network.

It should be noted that, the execution sequence of step 1201b and step 1201c is not sequential.

Step 1202, determining a first image matched with the reference pose information from the scene map, and acquiring first semantic information corresponding to the first image.

In an embodiment provided by the application, the electronic device may determine, according to reference pose information of a reference image, a first image matched with the reference image.

It should be noted that the scene map may include, in addition to a plurality of sets of point cloud data or a plurality of frames of scene images, pose information corresponding to each set of point cloud data or each frame of scene image. In this way, in the embodiment provided by the application, the electronic device may search, according to the pose information of the reference image, the pose information matched with the reference pose information from the scene map, and determine the scene image corresponding to the searched pose information as the first image.

Further, after obtaining the first image, the electronic device may further obtain first semantic information of the first image. In a possible example, semantic information of each group of point cloud data or each frame of image may also be included in the scene map, so that the electronic device may directly obtain the first semantic information of the first image from the scene map. In another possible example, after obtaining the first image, the electronic device may indirectly obtain the first semantic information by performing semantic segmentation processing on the first image. The embodiment of the application does not limit the manner of obtaining the first semantic information.

In one possible implementation, the scene map may be a three-dimensional point cloud map, i.e., the scene map is represented by a plurality of discrete sets of point cloud data.

Based on this, step 1202 determines a first image matching the reference pose information from the scene map, which may be implemented by:

step 1202a, searching a first three-dimensional point cloud set matched with reference pose information from a scene map;

and 1202b, projecting the points in the first three-dimensional point cloud set to a coordinate system of a reference image to obtain a first image.

Specifically, the reference pose information of the reference image may include a rotation matrix R and a translational vector T of the electronic device. The electronic device can determine a first set of three-dimensional point clouds P [ X ] in the scene map that match reference pose information (R, T) based on the reference pose information (R, T) of the reference image_wY_wZ_w]^T。

Further, the electronic device may re-project the first set of three-dimensional point clouds into image coordinate information of the reference image. Specifically, the coordinates of the first three-dimensional point cloud set in the reference image coordinate system can be obtained by formula (1).

Wherein, [ u v 1]^TThe coordinate of the first three-dimensional point cloud set in the reference image coordinate system is defined, K is an internal reference matrix of the camera of the electronic equipment, and s is scale information between the scene map coordinate system and the reference image coordinate system.

In this way, the electronic device may derive the first image based on the coordinates of the first set of three-dimensional point clouds in the reference image coordinate system.

Step 1203, calculating similarity between the reference semantic information and the first semantic information to obtain a semantic consistency weight value of the reference image.

In one possible implementation, semantic information of each point in the three-dimensional point cloud map may also be stored in the three-dimensional point cloud map.

In this way, after the first image is obtained in the steps 1202a and 1202b, the electronic device may also obtain the first semantic information corresponding to the first image from the three-dimensional point cloud map.

In the embodiment provided by the application, the electronic device can obtain the semantic consistency weight value of the reference image by calculating the Euclidean distance between the reference semantic information and the first semantic information; the electronic device may also calculate a repetition rate of the reference semantic information and the first semantic information to obtain a semantic consistency weight value of the reference image. The embodiments of the present application are not limited herein.

Next, a training process of the image processing model is exemplarily described.

In the embodiment provided by the application, the image processing model is obtained by training the image processing model to be trained. Wherein, the model to be trained comprises: the training method comprises the following steps of a convolutional layer to be trained, a time cycle neural network to be trained and a semantic segmentation network to be trained.

Referring to FIG. 3, a schematic diagram of a training process for an exemplary image processing model is shown; the training process may include the steps of:

the method comprises the first step of obtaining a plurality of continuously shot sample images 3-1.

Here, the multi-frame sample image 3-1 may be an image included in a video stream continuously shot. There is temporal continuity between each frame sample image of the plurality of frame sample images 3-1.

And secondly, inputting the multi-frame sample image 3-1 into the convolutional layer to be trained 3-2 for feature extraction according to the time sequence of the acquisition of the multi-frame sample image 3-1 to obtain a plurality of sample image features 3-3.

Here, the multi-frame sample images 3-1 acquired in the first step may be continuously input into the convolutional layer in the time order of acquisition for processing, that is, the image features of each frame sample image are extracted by the convolutional layer. Thus, after the convolutional layer processing, a plurality of sample image features 3-3 corresponding to the plurality of frames of continuous sample images 3-1 are obtained.

It should be noted that the plurality of sample images and the plurality of sample image features correspond to one another. I.e. one sample image corresponds to one sample image feature.

And thirdly, determining a plurality of pieces of predicted pose information 3-5 corresponding to the plurality of sample image features 3-3 through a to-be-trained time cycle neural network 3-4.

After obtaining the sample image features 3-3, inputting the sample image features 3-3 corresponding to the continuous multi-frame sample images into a time cyclic neural network 3-4 to be trained, wherein the time cyclic neural network 3-4 to be trained can transmit the related information of the previous frame sample image of the current sample image to the pose prediction of the next frame sample image, and the pose calculation precision can be improved by reasonably utilizing the related information among the sample images and then performing the pose prediction.

Note that there is a one-to-one correspondence between the plurality of sample image features and the plurality of predicted pose information. It is understood that there is also a one-to-one correspondence between the plurality of sample images and the plurality of predicted pose information.

Specifically, in the third step, a plurality of pieces of predicted pose information corresponding to a plurality of sample image features are determined through a time-cycle neural network to be trained, and the determination can be realized through the following steps:

step 1, based on the relevant information of an i-1 th frame of sample image, carrying out pose recognition processing on the sample characteristics corresponding to the i-th frame of sample image through a to-be-trained time cycle neural network, and determining the predicted pose information corresponding to the i-th frame of sample image and the relevant information of the i-th frame of sample image; wherein i is an integer greater than 1 and less than or equal to N;

step 2, based on the relevant information of the sample image of the ith frame, continuously performing pose recognition processing on the corresponding sample feature of the sample image of the (i + 1) th frame through a to-be-trained time cycle neural network, determining the predicted pose information corresponding to the sample image of the (i + 1) th frame until i is equal to N-1, and obtaining the predicted pose information corresponding to the features of the plurality of sample images respectively; n is the total number of the multi-frame sample images.

It can be understood that, when the electronic device performs pose recognition processing on the sample image features of the current i-th frame sample image through the time-cycle neural network to be trained, the relevant information of the previous frame sample image (i.e., the i-1-th frame sample image) of the i-th frame sample image and the image feature information of the i-th frame sample image may be input into the time-cycle neural network to be trained for processing, so as to obtain the predicted pose information corresponding to the current i-th frame sample image.

Then, the electronic device continues to perform pose recognition processing on the sample image features of the sample image of the (i + 1) th frame, and specifically, the electronic device may use the sample image features of the sample image of the (i + 1) th frame and the related information of the sample image of the previous frame (i.e., the sample image of the (i) th frame) as the input of the time-loop neural network to be trained, and obtain the predicted pose information corresponding to the sample image of the (i + 1) th frame through processing of the time-loop neural network to be trained.

Further, the electronic equipment continues to perform pose recognition processing on the sample image of the (i + 2) th sample image in the manner described above until the processing is stopped when the value of i is N-1. Therefore, the electronic equipment can obtain the corresponding predicted pose information of each frame of sample image.

And fourthly, determining a plurality of prediction semantic information 3-7 corresponding to the plurality of sample image features 3-3 through the semantic segmentation network 3-6 to be trained.

Here, the electronic device may perform semantic segmentation processing on the sample image features 3-3 corresponding to the plurality of sample images obtained in the second step through the to-be-trained semantic segmentation network 3-6 to obtain predicted semantic information 3-7 corresponding to each sample image feature.

It should be noted that the execution sequence of the third step and the fourth step is not sequential.

And fifthly, adjusting network parameters of the image processing model to be trained based on the loss of the plurality of pieces of predicted pose information 3-5 and the loss of the plurality of pieces of predicted semantic information 3-7 so that the loss of the output result of the obtained image processing model meets the convergence condition.

In the embodiment provided by the application, the to-be-trained image processing model is a built initial model, so that the predicted pose information and the predicted semantic information obtained after the to-be-trained image processing model processes the sample image are not the true pose information and semantic information of the sample image. There are differences between the predicted pose information and the actual pose information, and between the predicted semantic information and the actual pose information.

Here, the loss between each piece of predicted pose information and the true pose information of the sample image corresponding to the predicted pose information, and the loss between each piece of predicted semantic information and the true semantic information of the sample image corresponding to the predicted semantic information may be calculated, and the network parameters of the image processing model to be trained, that is, the network parameters of the convolutional layer to be trained, the time-looping neural network to be trained, and the semantic segmentation network to be trained, may be adjusted.

In the embodiment provided by the application, the electronic device can train the image processing model to be trained through a multi-classification loss function of semantic segmentation and an Euclidean distance loss function of pose identification.

Further, the electronic device continues to perform the same processing as the second step to the fourth step on the sample image based on the adjusted to-be-trained image processing model to obtain a set of output results (i.e., a plurality of pieces of predicted pose information and a plurality of pieces of predicted semantic information). And when the output result meets the convergence condition, stopping the training process of the image processing model to be trained to obtain the trained image processing model. And when the output result does not meet the convergence condition, continuously adjusting the network parameters of the image processing model to be trained according to the loss of the plurality of pieces of predicted pose information and the loss of the plurality of pieces of predicted semantic information, and continuously processing the sample image through the adjusted image processing model to be trained in the same way as the second step to the fourth step to obtain a group of output results until the output results meet the convergence condition.

Here, the convergence condition may be that a loss value between each predicted pose information and its corresponding real pose information is smaller than a first specific value, and a loss value between each predicted semantic information and its corresponding real semantic information is smaller than a second specific value.

According to the embodiment of the application, in the training process of the image processing model to be trained, the relevant information among the images can be reasonably utilized to predict the pose, and the calculation precision of the pose is improved.

Based on the foregoing embodiment, in the embodiment provided in this application, in step 110, performing pose identification processing on an image to be positioned to obtain initial pose information of an electronic device when the image to be positioned is acquired, which may be implemented in the following manner:

and performing pose identification processing on the image to be positioned through the image processing model to obtain initial pose information of the electronic equipment when the image to be positioned is acquired.

That is to say, the electronic device may obtain the initial pose information of the image to be positioned according to the relationship between the continuous image frames by using the image processing model, so as to improve the accuracy of the initial pose information.

Based on the foregoing embodiment, referring to the third flow diagram of the positioning method shown in fig. 4, in step 130, the target pose information of the electronic device when the image to be positioned is acquired is determined based on the related information and the initial pose information of the reference image, and the method can also be implemented by the following steps:

step 1301, obtaining pose information to be adjusted based on the relevant information and the initial pose information of the reference image;

step 1302, acquiring a second image with the largest semantic consistency weight value from M frames of images adjacent to the image to be positioned; m is an integer greater than 1;

step 1303, obtaining a second three-dimensional point cloud set matched with second pose information corresponding to a second image from the three-dimensional semantic map;

1304, performing characteristic point matching processing on the second three-dimensional point cloud set and the image to be positioned to obtain matched characteristic point pairs;

and 1305, constructing a target optimization equation based on the matched feature point pairs, and optimizing the pose information to be adjusted based on the optimization equation to obtain target pose information.

In the embodiment provided by the application, after the pose information to be adjusted is obtained based on the relevant information of the reference image and the initial pose information of the image to be positioned, in order to further improve the positioning result, the electronic equipment can select the feature points of the pose result of the image with a larger semantic consistency weight value in the three-dimensional point cloud map according to the M frames of images adjacent to the image to be positioned, match the feature points with the feature points extracted from the current image to be positioned, and then obtain a more accurate pose calculation result through PnP nonlinear optimization.

It will be appreciated that the M frames of images adjacent to the image to be located may be images whose image content is similar to the image content to be located, i.e. images for the same scene. And selecting a second image with a larger semantic consistency weight value from the M frames of images, namely, the selected second image is the image most relevant to the scene map, and the corresponding second pose information of the second image is the most accurate.

Further, the electronic device may acquire a second three-dimensional point cloud set matched with the second pose information from the scene map, and perform feature point matching through the second three-dimensional point cloud set and the image features of the image to be located to obtain a plurality of matching feature point pairs.

The process of obtaining target pose information corresponding to an image to be positioned through PnP nonlinear optimization is described in detail below.

For convenience of description, in the embodiment of the present application, P is used to represent a feature point in the second three-dimensional point cloud set, P is used to represent a feature point in the image to be located, and P may be a matching feature point pair.

Projecting the characteristic points in the second three-dimensional point cloud set to a coordinate system of an image to be positioned, wherein the coordinates of the characteristic points P can be represented by a formula (2):

wherein i represents a characteristic point number, [ X ]_i,Y_i,Z_i,1]^TIs the homogeneous coordinate of the characteristic point P in the coordinate system of the scene map, K is the internal reference matrix of the camera of the electronic equipment,

and a homogeneous coordinate system formed by the pose information (R, T) to be adjusted of the image to be positioned.

The theoretical projection coordinates of the characteristic point P in the coordinate system of the image to be positioned are obtained.

However, the coordinate of the feature point P actually matching the feature point P in the coordinate system of the image to be positioned is [ u ]_iv_i1]^T。

Further, the electronic device may calculate theoretical coordinates

With fruitWorld coordinate [ u ]_iv_i1]^TThe error between is called the reprojection error. For example, referring to the schematic diagram of the reprojection error calculation shown in fig. 5, a point P in the three-dimensional coordinate system is projected to a point P 'in the two-dimensional coordinate system, and a point in the actual two-dimensional coordinate system that matches the three-dimensional coordinate system is P, and it can be seen from fig. 5 that a reprojection error exists between the point P' formed by projection and the point P that matches the actual point.

Based on this, the electronic device may calculate the reprojection errors of all the matching feature point pairs obtained in step 1304, and further construct a target optimization equation according to the reprojection errors, where the target optimization equation may be a least square cost function, and the target optimization equation may be represented by formula (3):

and t is a pose transformation matrix when the reprojection error of the matched characteristic point pair is minimum, and is a state variable needing to be optimized in the formula (3).

Specifically, the value of t is adjusted, and when the target equation (3) is minimum, the electronic device can obtain optimal pose information through a pose transformation matrix.

In the embodiment provided by the present application, before the image to be positioned is obtained in step 110, a scene map may also be obtained.

In a possible implementation manner, the electronic device may obtain a scene map of a current scene from the cloud server.

In another possible implementation, the electronic device may pre-construct a scene map of the current scene.

Referring to the scene map construction flow diagram shown in fig. 6, the electronic device may construct a scene map by:

a, acquiring a plurality of frames of scene images through an image acquisition device;

b, constructing a three-dimensional point cloud map of the scene based on the multi-frame scene image;

c, performing semantic segmentation processing on the multi-frame scene images to obtain scene semantic information corresponding to each frame of scene image in the multi-frame scene images;

and d, semantic labeling is carried out on the three-dimensional objects in the three-dimensional map based on scene semantic information corresponding to each frame of scene image, and the scene map is obtained.

Here, the electronic device first captures a plurality of frames of scene images in a current scene through a vision sensor (i.e., a camera). Each frame of the scene image may include RGB data, as well as depth data, i.e., RGB-D data.

Further, the electronic equipment converts the RGB-D data of each frame of scene image into point cloud data, and then splices the point cloud data of the multiple frames of scene images to obtain the three-dimensional point cloud map.

In addition, the electronic equipment can perform semantic segmentation processing on the acquired scene images through a semantic segmentation algorithm to obtain semantic labels of each frame of scene images.

Furthermore, the electronic equipment semantically labels the points in the three-dimensional point cloud map according to the corresponding relation between the scene image and the points in the three-dimensional point cloud map during map building, simultaneously eliminates the three-dimensional points of the dynamic objects in the three-dimensional point cloud map to generate a scene map comprising semantic information,

The embodiments of the present application are described in detail below with reference to exemplary application scenarios.

The positioning method provided by the embodiment of the application can be applied to electronic equipment, and when a user holding the electronic equipment needs to determine the position of the user in a strange market environment, the electronic equipment can provide accurate positioning service for the user through the positioning method provided by the embodiment of the application, so that pain points encountered by the user in life are solved.

Referring to fig. 7, a fourth positioning flow diagram, the electronic device first captures a continuous video stream 7-1. Further, the electronic equipment inputs the collected continuous video stream 7-1 into an image processing model 7-2 trained in advance, the image processing model 7-2 processes each frame of image in the continuous video stream 7-1 through an LSTM network, and initial pose information 7-3 of each frame of image is output according to the correlation between each frame of image; meanwhile, the image processing model 7-2 carries out semantic segmentation processing on each frame of image through a semantic segmentation network, and outputs semantic information 7-4 of each frame of image.

Further, the electronic device may obtain a scene map 7-5 of a current scene from the cloud server.

Here, the electronic device may re-project the three-dimensional point cloud set matched with the initial pose information 7-3 of the current image frame in the scene map 7-5 into the acquired image coordinate system, compare the similarity between the semantic information 7-4 of the current image frame and the semantic information of the three-dimensional point cloud set re-projected into the image coordinate system, and obtain a semantic consistency weight value 7-6 of each frame of image, and may measure the accuracy of the calculation of the initial pose information of the current image frame by the semantic consistency weight, where the higher the semantic consistency is, the higher the accuracy of the weight value is.

Screening the related information of each frame of image through the semantic consistency weight value of the image; specifically, when the calculated semantic consistency weight value of the current image frame is large (for example, greater than 0.8), the accuracy of the initial pose information 7-3 representing the image frame is high, so that the related information corresponding to the image frame can be transferred to the next frame image processing process, and when the pose information of the next frame image is calculated, the pose information to be adjusted 7-7 of the next frame image can be determined based on the related information of the current frame and the initial pose information of the next frame.

When the semantic consistency weight value of the current image frame is small (for example, less than 0.8), the initial pose information 7-3 representing the image frame has low precision, and the relevant information corresponding to the image frame is not transmitted. Therefore, as time is accumulated, the related information of the image corresponding to the high-precision pose is continuously transmitted, and the calculation precision of the pose can be improved through the semantic consistency weight.

After the pose information to be adjusted of the image is obtained, in order to further improve the positioning result, the electronic equipment can match the feature points corresponding to the image frame in the scene map according to the pose result of the image with the larger semantic consistency weight value in the continuous video stream 7-1 with the feature points extracted from the current image frame, and then obtain the target pose information 7-8 corresponding to the current frame through the PnP nonlinear optimization, so that the pose calculation accuracy can be further improved.

Based on the foregoing embodiments, embodiments of the present application provide a positioning device, where the positioning device is applied to an electronic device; as shown in fig. 8, the apparatus includes:

an acquiring unit 801, configured to acquire an image to be positioned;

the first processing unit 802 performs pose identification processing on the image to be positioned to obtain initial pose information of the electronic device when the image to be positioned is acquired;

the determining unit 803 is configured to determine a semantic consistency weight value of a reference image corresponding to the image to be located; the reference image refers to an image which is acquired by the electronic equipment and adjacent to the image to be positioned;

the second processing unit 804 is configured to determine, based on the relevant information of the reference image and the initial pose information, target pose information of the electronic device when the image to be positioned is acquired, if the semantic consistency weight value of the reference image is greater than a weight threshold.

In some embodiments, the obtaining unit 801 is configured to obtain reference pose information and reference semantic information corresponding to the reference image;

a determining unit 803, configured to determine, from the scene map, a first image that matches the reference pose information, and acquire first semantic information corresponding to the first image; and calculating the similarity between the reference semantic information and the first semantic information to obtain a semantic consistency weight value of the reference image.

In some embodiments, the scene map is a three-dimensional point cloud map,

a determining unit 803, configured to find, from the scene map, a first three-dimensional point cloud set matching the reference pose information; and projecting the points in the first three-dimensional point cloud set to a coordinate system of the reference image to obtain the first image.

In some embodiments, the obtaining unit 801 is configured to perform pose identification processing and semantic segmentation processing on the reference image through an image processing model to obtain reference pose information of the electronic device when the reference image is acquired and reference semantic information of the reference image; the image processing model is used for determining pose information of the electronic equipment and semantic information of the image when the image is acquired.

In some embodiments, the image processing model includes a convolutional layer, a temporal cyclic neural network, and a semantic segmentation network;

an obtaining unit 801, configured to perform feature extraction on the reference image through the convolutional layer to obtain a reference image feature of the reference image; based on the related information of the previous frame of image adjacent to the reference image, performing pose identification processing on the reference image features through the time cycle neural network to obtain reference pose information of the electronic equipment when the reference image is acquired; and performing semantic segmentation processing on the reference image features through the semantic segmentation network to obtain reference semantic information of the reference image.

In some embodiments, the image processing model is obtained by training an image processing model to be trained, where the image processing model to be trained includes: the training method comprises the following steps of (1) carrying out convolutional layer to be trained, time cycle neural network to be trained and semantic segmentation network to be trained;

the positioning device further comprises a model training unit;

the model training unit is used for acquiring multi-frame sample images which are continuously shot; inputting the multi-frame sample images into the convolutional layer to be trained for feature extraction according to the time sequence of the multi-frame sample image acquisition to obtain a plurality of sample image features; determining a plurality of pieces of predicted pose information corresponding to the plurality of sample image features through the to-be-trained time cycle neural network; determining a plurality of prediction semantic information corresponding to the plurality of sample image features through the semantic segmentation network to be trained; and adjusting the network parameters of the image processing model to be trained based on the loss of the plurality of pieces of predicted pose information and the loss of the plurality of pieces of predicted semantic information so as to enable the obtained loss of the output result of the image processing model to meet a convergence condition.

In some embodiments, the model training unit is further configured to perform pose recognition processing on sample image features corresponding to an i-th frame sample image through the to-be-trained time-cycle neural network based on the relevant information of the i-1 th frame sample image, and determine predicted pose information corresponding to the i-th frame sample image and relevant information of the i-th frame sample image; wherein i is an integer greater than 1 and less than or equal to N;

based on the relevant information of the ith frame of sample image, continuing to perform pose recognition processing on the sample image features corresponding to the (i + 1) th frame of sample image through the to-be-trained time cycle neural network, determining the predicted pose information corresponding to the (i + 1) th frame of sample image, and obtaining the predicted pose information corresponding to the specific features of the plurality of sample images respectively until i is equal to N-1; n is the total number of the multi-frame sample images.

In some embodiments, the first processing unit 802 is configured to perform pose identification processing on the image to be positioned through the image processing model, so as to obtain initial pose information of the electronic device when the image to be positioned is acquired.

In some embodiments, the second processing unit 804 is configured to obtain pose information to be adjusted based on the related information of the reference image and the initial pose information; acquiring a second image with the largest semantic consistency weight value from M frames of images adjacent to the image to be positioned; m is an integer greater than 1; acquiring a second three-dimensional point cloud set matched with second pose information corresponding to the second image from the scene map;

performing characteristic point matching processing on the second three-dimensional point cloud set and the image to be positioned to obtain matched characteristic point pairs; and constructing a target optimization equation based on the matching feature point pairs, and optimizing the pose information to be adjusted based on the optimization equation to obtain the target pose information.

In some embodiments, the positioning device further includes a scene map construction unit, configured to collect multiple frames of scene images through the image collection device; constructing a three-dimensional point cloud map of the scene based on the multi-frame scene image; performing semantic segmentation processing on the multi-frame scene images to obtain scene semantic information corresponding to each frame of scene image in the multi-frame scene images; and semantic labeling is carried out on the three-dimensional objects in the three-dimensional map based on scene semantic information corresponding to each frame of scene image to obtain the scene map.

In some embodiments, the related information of the reference image comprises: reference pose information of the reference image, and/or key feature points of the reference image.

Correspondingly, an embodiment of the present disclosure provides an electronic device, fig. 9 is a schematic structural diagram of the electronic device in the embodiment of the present disclosure, and as shown in fig. 9, the electronic device 90 includes: a memory 901 for storing a computer program; the processor 902 is configured to, when executing the computer program stored in the memory 901, implement the steps of the positioning method provided in the foregoing embodiments.

The electronic device 90 further includes: a communication bus 903; the communication bus 903 is configured to enable connective communication between these components.

The above description of the computer device and storage medium embodiments is similar to the description of the method embodiments above, with similar beneficial effects as the method embodiments. For technical details not disclosed in the embodiments of the computer apparatus and storage medium of the present disclosure, reference is made to the description of the embodiments of the method of the present disclosure.

It should be appreciated that reference throughout this specification to "one embodiment" or "an embodiment" means that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one embodiment of the present disclosure. Thus, the appearances of the phrases "in one embodiment" or "in an embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. It should be understood that, in various embodiments of the present disclosure, the sequence numbers of the above-mentioned processes do not mean the execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation on the implementation process of the embodiments of the present disclosure. The above-mentioned serial numbers of the embodiments of the present disclosure are merely for description and do not represent the merits of the embodiments.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

In the several embodiments provided in the present disclosure, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described device embodiments are merely illustrative, for example, the division of the unit is only a logical functional division, and there may be other division ways in actual implementation, such as: multiple units or components may be combined, or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the coupling, direct coupling or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection between the devices or units may be electrical, mechanical or other forms.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units; can be located in one place or distributed on a plurality of network units; some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiments of the present disclosure.

In addition, all functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may be separately regarded as one unit, or two or more units may be integrated into one unit; the integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.

Those of ordinary skill in the art will understand that: all or part of the steps for realizing the method embodiments can be completed by hardware related to program instructions, the program can be stored in a computer readable storage medium, and the program executes the steps comprising the method embodiments when executed; and the aforementioned storage medium includes: various media that can store program codes, such as a removable Memory device, a Read Only Memory (ROM), a magnetic disk, or an optical disk. Alternatively, the integrated unit of the present disclosure may be stored in a computer-readable storage medium if it is implemented in the form of a software functional module and sold or used as a separate product. Based on such understanding, the technical solutions of the embodiments of the present disclosure may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the methods described in the embodiments of the present disclosure. And the aforementioned storage medium includes: a removable storage device, a ROM, a magnetic or optical disk, or other various media that can store program code.

The above description is only for the specific embodiments of the present disclosure, but the scope of the present disclosure is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present disclosure, and all the changes or substitutions should be covered within the scope of the present disclosure. Therefore, the protection scope of the present disclosure shall be subject to the protection scope of the claims.

Claims

1. A positioning method is applied to an electronic device, and the method comprises the following steps:

2. The method according to claim 1, wherein the determining the semantic consistency weight value of the reference image corresponding to the image to be positioned comprises:

acquiring reference pose information and reference semantic information corresponding to the reference image;

determining a first image matched with the reference pose information from the scene map, and acquiring first semantic information corresponding to the first image;

and calculating the similarity between the reference semantic information and the first semantic information to obtain a semantic consistency weight value of the reference image.

3. The method of claim 2, wherein the scene map is a three-dimensional point cloud map, and wherein determining the first image from the scene map that matches the reference pose information comprises:

searching a first three-dimensional point cloud set matched with the reference pose information from the scene map;

and projecting the points in the first three-dimensional point cloud set to a coordinate system of the reference image to obtain the first image.

4. The method according to claim 2 or 3, wherein the acquiring reference pose information and reference semantic information of the reference image comprises:

performing pose recognition processing and semantic segmentation processing on the reference image through an image processing model to obtain reference pose information of the electronic equipment when the reference image is acquired and reference semantic information of the reference image;

the image processing model is used for determining pose information of the electronic equipment and semantic information of the image when the image is acquired.

5. The method of claim 4, wherein the image processing model comprises a convolutional layer, a temporal cyclic neural network, and a semantic segmentation network;

the processing method includes the steps of performing pose recognition processing and semantic segmentation processing on the reference image through an image processing model to obtain pose information of the electronic equipment when the reference image is acquired and reference semantic information of the reference image, and includes:

performing feature extraction on the reference image through the convolutional layer to obtain a reference image feature of the reference image;

based on the related information of the previous frame of image adjacent to the reference image, performing pose identification processing on the reference image features through the time cycle neural network to obtain reference pose information of the electronic equipment when the reference image is acquired;

and performing semantic segmentation processing on the reference image features through the semantic segmentation network to obtain reference semantic information of the reference image.

6. The method according to claim 4 or 5, wherein the image processing model is obtained by training an image processing model to be trained, and the model to be trained comprises: the training method comprises the following steps of (1) carrying out convolutional layer to be trained, time cycle neural network to be trained and semantic segmentation network to be trained; wherein,

the training process of the image processing model comprises the following steps:

acquiring continuously shot multi-frame sample images;

inputting the multi-frame sample images into the convolutional layer to be trained for feature extraction according to the time sequence of the multi-frame sample image acquisition to obtain a plurality of sample image features;

determining a plurality of pieces of predicted pose information corresponding to the plurality of sample image features through the to-be-trained time cycle neural network;

determining a plurality of prediction semantic information corresponding to the plurality of sample image features through the semantic segmentation network to be trained;

and adjusting the network parameters of the image processing model to be trained based on the loss of the plurality of pieces of predicted pose information and the loss of the plurality of pieces of predicted semantic information so as to enable the obtained loss of the output result of the image processing model to meet a convergence condition.

7. The method according to claim 6, wherein the determining, by the time-cycle neural network to be trained, a plurality of pieces of predicted pose information corresponding to the plurality of sample image features comprises:

based on the relevant information of the sample image of the i-1 th frame, carrying out pose recognition processing on the sample image characteristics corresponding to the sample image of the i th frame through the to-be-trained time cycle neural network, and determining the predicted pose information corresponding to the sample image of the i th frame and the relevant information of the sample image of the i th frame; wherein i is an integer greater than 1 and less than or equal to N;

based on the relevant information of the ith frame of sample image, continuing to perform pose recognition processing on the sample image features corresponding to the (i + 1) th frame of sample image through the to-be-trained time cycle neural network, determining the predicted pose information corresponding to the (i + 1) th frame of sample image, and obtaining the predicted pose information corresponding to the sample image features respectively until i is equal to N-1; n is the total number of the multi-frame sample images.

8. The method according to any one of claims 4 to 7, wherein the performing pose identification processing on the image to be positioned to obtain initial pose information of the electronic device when the image to be positioned is acquired comprises:

and carrying out pose identification processing on the image to be positioned through the image processing model to obtain initial pose information of the electronic equipment when the image to be positioned is collected.

9. The method according to any one of claims 1 to 8, wherein the determining target pose information of the electronic device at the time of acquiring the image to be positioned based on the correlation information of the reference image and the initial pose information comprises:

obtaining pose information to be adjusted based on the relevant information of the reference image and the initial pose information;

acquiring a second image with the largest semantic consistency weight value from M frames of images adjacent to the image to be positioned; m is an integer greater than 1;

acquiring a second three-dimensional point cloud set matched with second pose information corresponding to the second image from the scene map;

performing characteristic point matching processing on the second three-dimensional point cloud set and the image to be positioned to obtain matched characteristic point pairs;

and constructing a target optimization equation based on the matching feature point pairs, and optimizing the pose information to be adjusted based on the optimization equation to obtain the target pose information.

10. The method according to any of claims 1-9, wherein prior to said acquiring an image to be located, the method further comprises:

acquiring a multi-frame scene image through an image acquisition device;

constructing a three-dimensional point cloud map of the scene based on the multi-frame scene image;

performing semantic segmentation processing on the multi-frame scene images to obtain scene semantic information corresponding to each frame of scene image in the multi-frame scene images;

and semantic labeling is carried out on the three-dimensional objects in the three-dimensional point cloud map based on scene semantic information corresponding to each frame of scene image to obtain the scene map.

11. The method according to any one of claims 1 to 10, wherein the information related to the reference image comprises:

reference pose information of the reference image, and/or key feature points of the reference image.

12. A positioning device, applied to an electronic device, the device comprising:

the acquisition unit is used for acquiring an image to be positioned;

13. An electronic device, characterized in that the electronic device comprises a processor and a memory for storing a computer program executable on the processor;

the processor and the memory are connected through a communication bus;

wherein the processor, when executing the computer program stored in the memory, performs the steps of the positioning method of any of claims 1 to 11.

14. A computer-readable storage medium, on which a computer program is stored which is executed by a processor for carrying out the steps of the positioning method according to any one of claims 1 to 11.