WO2020238790A1

WO2020238790A1 - Camera positioning

Info

Publication number: WO2020238790A1
Application number: PCT/CN2020/091768
Authority: WO
Inventors: 鲍虎军; 章国锋; 黄昭阳; 许龑
Original assignee: 浙江商汤科技开发有限公司
Priority date: 2019-05-27
Filing date: 2020-05-22
Publication date: 2020-12-03
Also published as: KR20210095925A; JP2022513868A; CN112001968B; CN112001968A

Abstract

A camera positioning method and device, and a storage medium. The method comprises: obtaining a prior probability of a movable object appearing at each of multiple pixels comprised in an image template (110); performing, according to the prior probability, an operation of discarding some pixels on an image to be processed that has the same size as the image template to obtain a target image (120); and determining an absolute pose of the camera in a world coordinate system according to the target image (130).

Description

Camera positioning

Technical field

The present disclosure relates to the field of computer vision, in particular to a camera positioning method and device, and storage medium.

Background technique

Visual positioning has a wide range of applications. In the actual application environment, factors such as object movement may affect the accuracy of visual positioning, and even directly cause visual positioning to fail.

Summary of the invention

The present disclosure provides a camera positioning method, device, and storage medium.

According to a first aspect of the embodiments of the present disclosure, there is provided a camera positioning method, the method including:

Obtain the prior probability of a movable object appearing at each of the multiple pixels included in the image template;

Performing an operation of discarding some pixels for an image to be processed that is as large as the image template according to the prior probability to obtain a target image;

The absolute pose of the camera that collects the image to be processed in the world coordinate system is determined according to the target image.

According to a second aspect of the embodiments of the present disclosure, there is provided a camera positioning device, including:

An acquisition module for acquiring the prior probability of a movable object appearing at each of the multiple pixels included in the image template;

The execution module is configured to perform an operation of discarding some pixels for an image to be processed that is as large as the image template according to the prior probability to obtain a target image;

The positioning module is configured to determine, according to the target image, the absolute pose of the camera that collects the image to be processed in the world coordinate system.

According to a third aspect of the embodiments of the present disclosure, a computer-readable storage medium is provided, the storage medium stores a computer program, and the computer program is used to execute the camera positioning method described in the first aspect.

According to a fourth aspect of the embodiments of the present disclosure, there is provided a camera positioning device, the device comprising: a processor; and a memory for storing executable instructions of the processor. The processor is configured to call executable instructions stored in the memory to implement the camera positioning method described in the first aspect.

In this embodiment, the prior probability that a movable object appears at each of the multiple pixels included in the image template can be obtained first, and some pixels are discarded based on the prior probability of the image to be processed that is as large as the image template. Point operation to obtain the target image, and determine the absolute pose of the camera in the world coordinate system according to the target image. The effect of the movement of the object in the scene where the camera collects the image is reduced on the result of the positioning of the camera on the movable machine equipment, and the accuracy of the camera positioning is improved.

It should be understood that the above general description and the following detailed description are only exemplary and explanatory, and cannot limit the present disclosure.

Description of the drawings

Fig. 1 is a flowchart of a camera positioning method according to an exemplary embodiment of the present disclosure;

Fig. 2 is a flowchart of step 110 according to an exemplary embodiment of the present disclosure;

Fig. 3 is a schematic diagram showing an image template according to an exemplary embodiment of the present disclosure;

Fig. 4 is a flowchart showing step 120 according to an exemplary embodiment of the present disclosure;

Fig. 5 is a flowchart of a camera positioning method according to another exemplary embodiment of the present disclosure;

Fig. 6 is a flowchart of step 150 according to an exemplary embodiment of the present disclosure;

Fig. 7 is a schematic diagram showing multiple absolute poses according to an exemplary embodiment of the present disclosure;

Fig. 8 is a schematic diagram showing a process of determining and correcting a pose according to an exemplary embodiment of the present disclosure;

Fig. 9 is a schematic diagram showing an optimized pose graph according to an exemplary embodiment of the present disclosure;

Fig. 10 is a flowchart of a camera positioning method according to another exemplary embodiment of the present disclosure;

Fig. 11 is a flowchart showing step 230 according to an exemplary embodiment of the present disclosure;

12A to 12B are schematic diagrams showing a self-attention mechanism according to an exemplary embodiment of the present disclosure;

Fig. 13A is a schematic diagram showing an image to be processed according to an exemplary embodiment of the present disclosure;

Fig. 13B is a schematic diagram showing a feature extraction image after weight value adjustment according to an exemplary embodiment of the present disclosure;

Fig. 14 is a flowchart of a camera positioning method according to another exemplary embodiment of the present disclosure;

Fig. 15 is a frame diagram of a target neural network according to an exemplary embodiment of the present disclosure;

Fig. 16 is a block diagram showing a camera positioning device according to an exemplary embodiment of the present disclosure;

Fig. 17 is a block diagram showing an obtaining module according to an exemplary embodiment of the present disclosure;

Fig. 18 is a block diagram showing an execution module according to an exemplary embodiment of the present disclosure;

Fig. 19 is a block diagram showing a positioning module according to an exemplary embodiment of the present disclosure;

Fig. 20 is a block diagram showing a camera positioning device according to another exemplary embodiment of the present disclosure;

Fig. 21 is a block diagram showing a second determining module according to an exemplary embodiment of the present disclosure;

Fig. 22 is a block diagram showing a camera positioning device according to another exemplary embodiment of the present disclosure;

Fig. 23 is a block diagram showing an obtaining module according to an exemplary embodiment of the present disclosure;

Fig. 24 is a block diagram showing an execution module according to an exemplary embodiment of the present disclosure;

Fig. 25 is a block diagram showing a positioning module according to an exemplary embodiment of the present disclosure;

Fig. 26 is a block diagram showing a camera positioning device according to another exemplary embodiment of the present disclosure;

Fig. 27 is a schematic structural diagram of an electronic device according to an exemplary embodiment of the present disclosure.

Detailed ways

Here, exemplary embodiments will be described in detail, and examples thereof are shown in the accompanying drawings. When the following description refers to the drawings, unless otherwise indicated, the same numbers in different drawings indicate the same or similar elements. The implementation manners described in the following exemplary embodiments do not represent all implementation manners consistent with the present disclosure. Rather, they are merely examples of devices and methods consistent with some aspects of the present disclosure as detailed in the appended claims.

The terms operating in the present disclosure are only for the purpose of describing specific embodiments, and are not intended to limit the present disclosure. The singular forms of "a", "said" and "the" used in this disclosure and the appended claims are also intended to include plural forms, unless the context clearly indicates other meanings. It should also be understood that the term "and/or" as used herein refers to and includes any or all possible combinations of one or more associated listed items.

It should be understood that although the terms first, second, third, etc. may be used in this disclosure to describe various information, the information should not be limited to these terms. These terms are only used to distinguish the same type of information from each other. For example, without departing from the scope of the present disclosure, the first information may also be referred to as second information, and similarly, the second information may also be referred to as first information. Depending on the context, the word "if" as used herein can be interpreted as "when" or "when" or "in response to a certainty".

The embodiments of the present disclosure provide a camera positioning method, which can discard parts of the image to be processed that are as large as the image template according to the prior probability of a movable object appearing at each of the multiple pixels included in the image template The pixel points are used to obtain the target image, and then the absolute pose of the camera is determined according to the target image, which reduces the influence of the movement of objects in the scene where the camera collects the image on the camera positioning result, and improves the accuracy and precision of the camera positioning.

The camera positioning method provided by the embodiments of the present disclosure can be applied to a movable machine and equipment to position a camera provided on the movable machine and equipment. Movable machinery and equipment include, but are not limited to, drones, unmanned vehicles, and robots with cameras.

Since the movable machinery and equipment will move, the pose of the camera set on the equipment will change accordingly. The accuracy of camera positioning can improve the accuracy of mobile machinery and equipment when performing various tasks. For example, according to the image of the forward environment of the vehicle collected by the camera installed on the unmanned vehicle, the current location information of the camera can be determined, and the current location information of the vehicle can be located according to the location information of the camera, so that the unmanned vehicle can be The vehicle performs at least one intelligent driving control such as path planning, trajectory tracking, and collision warning.

As shown in FIG. 1, the camera positioning method provided by the embodiment of the present disclosure may include the following steps 110-130:

In step 110, the prior probability of a movable object appearing at each of the multiple pixels included in the image template is obtained.

In the embodiment of the present disclosure, the image template may include a template corresponding to the current scene and used to record the prior probability of a movable object appearing at each of multiple pixels on an image as large as the image template. . Movable objects include, but are not limited to, various objects that can move on their own or under control, such as buses, cars, people, bicycles, trucks, motorcycles, animals, etc. The prior probability refers to the probability that each pixel on the image is a movable object obtained by analyzing the image that is the same or similar to the current scene collected in the past. If the prior probability corresponding to a certain pixel is high, it means that there is a high possibility of a movable object at that pixel in the image collected for the scene; conversely, if the prior probability corresponding to a certain pixel is low, it means that The possibility of a movable object at this pixel point in the image collected by the scene is low. The image template can reflect the a priori possibility of movable objects appearing at different pixels in the collected image.

The probability of a movable object appearing at each pixel on each image in the above image collection can be analyzed for a collection of images collected from a scene that is the same or similar to the current scene, and this probability can be used as each image template corresponding to the current scene. The prior probability that a movable object appears at each pixel.

For example, when the current scene is an unmanned vehicle driving on a main street in a city, if the camera installed on the unmanned vehicle is positioned, the image collection collected in the same or similar scene as the current scene can include the main street of the city. At least one image.

In step 120, an operation of discarding some pixels is performed on an image to be processed that is as large as the image template according to the prior probability to obtain a target image.

The image to be processed may be at least one image collected by a camera provided on the movable machine equipment during the movement of the movable machine equipment. The mobile machinery and equipment can perform the discarding part on at least one image that is as large as the image template collected by the camera set on the mobile machinery and equipment according to the prior probability corresponding to each pixel on the image template corresponding to the current scene The operation of the pixel points to obtain the target image.

In the embodiment of the present disclosure, the operation of discarding part of the pixels includes, but is not limited to, discarding all or all pixels whose a priori probability sampling value is greater than a preset value on at least one image collected by the camera and the size of the image template. Partially discarded randomly.

In step 130, the absolute pose of the camera collecting the image to be processed in the world coordinate system is determined according to the target image.

For example, the mobile machine equipment can determine the absolute pose of the camera in the world coordinate system according to the target image through regression loss function. Among them, the regression loss function can be a mean square error loss function (such as L2 loss function), average absolute error (such as L1 loss function), smooth average absolute error loss function (such as Huber loss function), logarithmic hyperbolic cosine loss function or Quantile loss function, etc.

In the above embodiment, the movable machinery and equipment can be combined with the prior probability that a movable object appears at each of the multiple pixels on the image template corresponding to the current scene. At least one of the collected images is discarded with some pixels to obtain the target image, and the target image is used to determine the absolute pose of the camera, which can effectively reduce the negative impact of the movement of objects in the current scene on the camera positioning and improve the camera positioning Accuracy and precision.

For a camera set on a movable machine equipment, its pose may be changed due to factors such as the movement of the movable machine equipment and/or the position adjustment of the camera, so that the camera needs to be positioned. The inventor of the present disclosure found that if there is movement of an object in the field of view of the image captured by the camera, the movement of the object will cause poor imaging quality of the corresponding part of the image captured by the camera, such as image blur, jitter, etc. Poor quality parts will affect the quality of the overall features of the captured image, and further affect the accuracy and precision of camera positioning based on the overall features of the image. However, some immobile or fixed objects in the captured image are actually useful for camera positioning.

To this end, the embodiments of the present disclosure determine the probability of a movable object appearing at each pixel in the captured image (that is, the prior probability) by combining prior knowledge, and perform partial pixel discarding on the captured image based on the determined probability , Such as discarding some of the pixels with a higher a priori probability of moving objects, which can reduce the negative impact of these pixels on the overall quality of the image, which is beneficial to improve the overall quality of the image after the local pixel is discarded. The accuracy of positioning.

In some optional embodiments, step 110 may be performed by an electronic device, which may be a mobile machine device, or an electronic device for training a neural network, such as a cloud platform, which is not limited in the present disclosure. As shown in Figure 2, step 110 may include steps 111-113:

In step 111, pixel-level semantic segmentation is performed on each image in a predetermined image set associated with the current scene.

In the embodiment of the present disclosure, the predetermined image set associated with the current scene includes multiple pictures collected in the same or similar scene as the current scene. The electronic device can obtain the pixel-level semantic segmentation result of each image by searching for the content existing on each image in the predetermined image set. For example, assuming that the current scene is an unmanned vehicle driving on a main street in a city, the predetermined image set associated with the current scene may include images m1, m2,...mN as shown in FIG. 3.

In step 112, the first pixel belonging to the movable object and the second pixel belonging to the background in each image are determined according to the result of pixel-level semantic segmentation.

Optionally, the background is an immovable object in the image, for example, other objects in the image that are not determined to be movable objects, such as sky, buildings, trees, roads, etc.

In step 113, based on the statistical distribution of the first pixel and the second pixel on each image in the predetermined image set, it is determined that each of the multiple pixels included in the image template that is the same size as the image in the predetermined image set The prior probability that a movable object appears at each pixel.

In the embodiment of the present disclosure, the electronic device obtains a statistical distribution corresponding to the current scene based on the statistical distribution of the first pixel of the movable object and the second pixel of the background in each image in the predetermined image set associated with the current scene An image template, such as the image template M in FIG. 3, is used to record the prior probability that a movable object appears at each pixel in an image that is as large as the image template collected in the current scene.

In the embodiment of the present disclosure, the prior probability of a movable object at each pixel recorded on the image template is a statistical distribution range, rather than a fixed value. When the subsequent operation of discarding some pixels of an image to be processed that is as large as the image template is performed according to the prior probability, different pixels can be discarded according to the statistical distribution range of the prior probability each time to obtain different targets. image. In addition, determining the absolute pose of the camera based on multiple different target images can obtain better camera positioning results, especially in large-scale urban traffic scenes.

Optionally, the prior probability that a movable object appears at each pixel included in the image template may conform to a Gaussian distribution, as shown in formula 1:

p(M(i,j))～N(σ ² (i,j),μ(i,j)), formula 1

Among them, i represents the pixel point of the i-th row on the image template, j represents the pixel point of the j-th column on the image template, (i,j) corresponds to the pixel coordinates, and the mathematical expectation of the pixel point (i,j) is μ(i ,j),

Among them, N is the number of pixels, the variance of the pixel (i,j) is σ ² (i,j), σ ² (i,j)=μ(i,j)(1-μ(i,j)) , P(M(i,j)) is the prior probability of pixel (i,j).

In some optional embodiments, such as shown in FIG. 4, step 120 may include:

In step 121, a priori probability corresponding to at least some pixels included in the image to be processed is sampled.

For at least one to-be-processed image collected by the camera, the distribution of the prior probability that a movable object appears at each pixel on each to-be-processed image satisfies a Gaussian distribution.

For each to-be-processed image in at least one to-be-processed image, the mobile machine and equipment can sample the prior probability corresponding to at least some of the pixels included in the to-be-processed image to obtain the image on the to-be-processed image after this sampling. The sampling value of the prior probability corresponding to at least some pixels.

In step 122, pixel points whose a priori probability sampling value is greater than a preset threshold are removed from the image to be processed to obtain a target image corresponding to this sampling.

In this sampling result, if the sampling value of the prior probability of pixel 1 on image 1 to be processed is greater than the preset threshold, then pixel 1 is considered to be a movable object, and movable machinery and equipment can remove it from image 1 to be processed Pixel 1 to obtain the target image corresponding to the current sampling of image 1 to be processed.

For each to-be-processed image in at least one to-be-processed image, the mobile machine equipment can remove all pixels with a priori probability that the sampling value is greater than a preset threshold for the to-be-processed image in the above-mentioned manner, or randomly remove part of the priori probability For pixel points whose sampling value is greater than the preset threshold value, the target image corresponding to the current sampling of the image to be processed is obtained.

In some optional embodiments, if the mobile machine equipment performs multiple sampling of the prior probability corresponding to the pixel on the image to be processed, the prior probability of the same pixel on the same image to be processed can be The sampling values are different, so that there is at least one different pixel point between the multiple target images obtained after the operation of discarding some pixels.

For example, in the first sampling, the sampling value of the prior probability corresponding to pixel 1 on image 1 to be processed is P ₁ , and in the second sampling, the sampling of the prior probability corresponding to pixel 1 on image 1 to be processed The value is P ₂ and the preset threshold is T. Among them, P ₁ <T<P ₂ . Then the target image obtained after the first sampling retains pixel 1, and the target image obtained after the second sampling needs to remove pixel 1.

Through the above process, the mobile equipment can sample the prior probabilities corresponding to the pixels on the same image to be processed multiple times, and accordingly obtain multiple different target images for camera positioning, which is beneficial to guarantee the final result The accuracy of camera positioning.

In some optional embodiments, step 130 may include: inputting the to-be-processed image into a target neural network to obtain the absolute pose of the camera in the world coordinate system.

The mobile machine equipment can input the image to be processed into the target neural network, and the target neural network directly outputs the absolute pose of the camera that collects the image to be processed in the world coordinate system.

In the above-mentioned embodiment, the movable machine equipment discards at least some pixels on the image to be processed with a prior probability greater than a preset value according to the prior probability that each pixel on the image template belongs to the movable object, thereby improving the camera positioning Accuracy.

In some optional embodiments, if the image to be processed includes k frames of images (k is an integer greater than or equal to 2) that are acquired by the camera in time sequence, that is, time sequence, as shown in FIG. 5, The method also includes steps 140-150:

In step 140, the relative pose of the camera when shooting the k frames of images is determined according to the k frames of images.

In the embodiments of the present disclosure, the movable machine equipment can use the visual odometry method to determine the relative pose of the camera when acquiring the k-th frame image with respect to the acquisition of the k-1 frame image.

In step 150, the corrected pose of the camera is determined according to the relative pose and absolute pose of the camera.

In the embodiment of the present disclosure, the mobile machine equipment can use the camera's absolute pose in the world coordinate system when acquiring the first frame image (also referred to as the first frame image) in the sequence of k frames of images as a reference , According to the relative pose and absolute pose of the camera when the second frame image adjacent to the first frame image is collected, the corrected pose of the camera is determined.

Subsequent movable machinery and equipment can adjust the pose of the camera according to the corrected pose, thereby reducing the impact of the movement of objects in the scene on the positioning of the camera, which can help ensure the accuracy of the movable machinery and equipment in performing various tasks.

In some optional embodiments, as shown in FIG. 6, step 150 may specifically include steps 151-153:

In step 151, the deterministic probability of the absolute pose is determined.

In the embodiment of the present disclosure, the deterministic probability is an evaluation of the accuracy of the result of the absolute pose. If the probability of certainty is higher, the result of the absolute pose is more accurate, otherwise the result of the absolute pose is less accurate.

The movable machinery and equipment can adopt a random sampling method, such as Monte Carlo method, to sample the prior probabilities corresponding to the k-frame images with sequential nature collected by the camera to obtain the sampling result of multiple sampling. k is an integer greater than or equal to 2.

For example, as shown in Figure 7, the current image can be sampled multiple times based on the prior probability of each pixel included in the image template M, and multiple absolute positions corresponding to the current image can be determined based on the target image corresponding to each sample. posture.

The deterministic probability of the absolute pose corresponding to the current image is determined according to the multiple absolute poses corresponding to the current image. For example, if the difference between the absolute poses corresponding to the current image is large, it can be determined that the absolute pose corresponding to the current image has a low probability of certainty. Otherwise, the absolute pose corresponding to the current image is determined to be certain The probability is higher.

In step 152, the first weight of the relative pose and the second weight of the absolute pose are determined according to the deterministic probability of the absolute pose.

In the embodiments of the present disclosure, for k frames of images with sequential nature collected by the camera, the movable machine equipment can determine the first weight of the relative pose corresponding to each frame of image according to the deterministic probability of the absolute pose corresponding to each frame of image And the second weight of the absolute pose corresponding to each frame of image.

For example, if the absolute pose corresponding to the current image has a high certainty probability, the second weight of the absolute pose corresponding to the current image can be increased; if the absolute pose corresponding to the current image has a low certainty probability, it can be increased The first weight of the relative pose corresponding to the current image.

In step 153, the corrected pose of the camera is determined according to the relative pose, the first weight, the absolute pose, and the second weight.

In the embodiments of the present disclosure, for example, as shown in FIG. 8, taking the absolute pose corresponding to the first frame of the k-frame image with time sequence as a reference, the sliding window is adopted to move sequentially, and the relative position corresponding to the second frame of image The pose, the first weight, the absolute pose and the second weight determine the corrected pose of the second frame of image relative to the first frame of image.

In the embodiments of the present disclosure, if the relative pose is more accurate, the weight of the relative pose can be increased, and if the absolute pose is more accurate, the weight of the absolute pose can be increased. In this way, by making the relative pose and the absolute pose each have different weights to determine the corrected pose, the corrected pose can be made more accurate, and the camera positioning can be more accurate.

By correcting the pose, optimize the final pose map of the camera. The optimized pose map can be shown in Figure 9. The triangle in Figure 9 represents the absolute pose when the camera collects each frame of image, with The arrow line represents the relative pose, and the circle represents the sliding window. The corrected absolute pose and relative pose in Fig. 9 correspond to the absolute pose and relative pose in Fig. 8 from the upper left corner to the lower right corner in sequence according to the arrow direction.

In the foregoing embodiment, the pose determined by the VO (Visual Odometry) method may be used as the relative pose corresponding to the image. Among them, the VO method is to determine the position and posture of the camera by analyzing the above k frames of images. Estimating the movement of the camera between adjacent frames by performing feature matching on k frames of images, so as to obtain the relative pose of the camera when the next frame is collected compared to the previous frame.

Further, in the embodiments of the present disclosure, the absolute pose and relative pose are combined to perform pose correction, which further improves the accuracy of camera positioning.

In an embodiment, the camera positioning method provided in the present disclosure can also be applied to electronic devices that train neural networks, such as cloud platforms, neural network training platforms, and so on. The electronic device uses this method to train the neural network to obtain the target neural network. After the image is subsequently input to the target neural network, the absolute pose of the camera that collected the image in the world coordinate system can be obtained.

As shown in FIG. 10, the camera positioning method provided by the embodiment of the present disclosure may include the following steps 210-230:

In step 210, the prior probability that a movable object appears at each of the multiple pixels included in the image template is obtained.

On each image in the predetermined image set, pixels belonging to a movable object are known. The electronic device can analyze the probability of a movable object appearing at each pixel of each image based on each of the above images, and use this probability as the occurrence of a movable object at each pixel on an image template that is the same size as each image. The prior probability.

In step 220, according to the prior probability, an operation of discarding part of pixels is performed on an image to be processed that is as large as the image template to obtain a target image.

The image to be processed may be at least one sample image, and the electronic device may perform the operation of discarding some pixels on the at least one sample image according to the prior probability corresponding to each pixel on the image template, so as to obtain the target image.

In the embodiment of the present disclosure, the operation of discarding some pixels includes but is not limited to the operation of discarding all pixels or randomly partially discarding pixels on at least one sample image whose a priori probability sampling value is greater than a preset value.

In step 230, the absolute pose of the camera that collects the image to be processed in the world coordinate system is determined according to the target image.

The electronic device can determine the absolute pose of the camera that collects at least one sample image in the world coordinate system through the regression loss function according to the obtained target image.

Among them, the regression loss function can be a mean square error loss function (such as L2 loss function), average absolute error (such as L1 loss function), smooth average absolute error loss function (such as Huber loss function), log hyperbolic cosine loss function, and Quantile loss function, etc.

In some optional embodiments, step 210 may be performed by an electronic device that trains a neural network, and the execution process is the same as the execution process of step 110 in FIG. 2, and will not be repeated here.

In some optional embodiments, step 220 may be performed by an electronic device that trains a neural network, and the execution process is the same as that of step 120 in FIG. 4, and will not be repeated here.

In some optional embodiments, step 230 may be performed by an electronic device that trains a neural network. For example, as shown in FIG. 11, step 230 may include steps 231-233:

In step 231, the feature parameters in the target image are extracted through a neural network to obtain a feature extraction image.

The neural network can extract feature parameters of each target image from at least one target image, thereby obtaining a feature extraction image corresponding to each target image.

In step 232, on the preset spatial dimension and/or preset channel dimension of the neural network, the weight value corresponding to the second pixel point belonging to the background in the feature extraction image is increased.

The neural network can increase the weight value of the second pixel point belonging to the background in the feature extraction image in at least one of the preset space dimension and the preset channel dimension through a self-attention mechanism.

For example, as shown in FIG. 12A, the neural network transforms a certain feature extraction image of H (height) × W (width) × C (channel) using a spatial self-attention mechanism to obtain an image H × W × 1 on the same channel. For another example, as shown in Figure 12B, the neural network transforms a certain feature extraction image of H×W×C using the channel self-attention mechanism to obtain an image 1×1×C with the same height and width.

Through the self-attention mechanism, the neural network ignores the information of the first pixel of the movable object as much as possible, and pays more attention to the information of the second pixel of the background.

In the preset space dimension and the preset channel dimension of the neural network, after adding the weight value of the second pixel circled with a solid square on the image shown in FIG. 13A, the image shown in FIG. 13B is obtained. The gray values of pixels in the image shown in FIG. 13B are higher than the gray values of pixels in other parts of the image shown in FIG. 13B.

In the embodiment of the present disclosure, in the image shown in FIG. 13A, the pixels circled by the dashed box belong to the movable object automobile, and each of the image templates that are as large as the image shown in FIG. 13A can be obtained through the previous step 210. The prior probability that a movable object appears at the pixel point, and then step 220 discards all or part of the pixel point in the image shown in FIG. 13A whose sampling value of the prior probability is greater than the preset threshold.

Further, through step 232, the weight values belonging to immovable objects are increased in two dimensions, so that the neural network pays more attention to traffic signs, electric poles and other immovable or low-moving objects, reducing the location where the camera collects images. The effect of the movement of objects in the scene on the result of the positioning of the camera on the movable machine equipment improves the accuracy and precision of the positioning of the camera by the neural network, and improves the robustness of the positioning detection result.

In step 233, the feature extraction image adjusted by the weight value is analyzed by the neural network to obtain the absolute pose of the camera that collects the image to be processed in the world coordinate system.

In the embodiment of the present disclosure, the neural network can analyze the feature extraction image after the weight value adjustment through regression loss function, such as the mean square error function, the absolute value error function, etc., to obtain the camera that collects at least one sample image in the world coordinate system The absolute pose of the next.

In some optional embodiments, such as shown in FIG. 14, during the process of neural network training, the above-mentioned camera positioning method further includes step 240:

In step 240, according to the difference between the absolute pose and the predetermined true value of the camera's pose of the image to be processed, the network parameters of the neural network are adjusted to obtain the target neural network by training.

In the embodiment of the present disclosure, this step may be performed by an electronic device that trains a neural network. When the camera acquires at least one sample image that is as large as the image template, the true value of the pose is known. The electronic device can use the neural network output to collect at least one sample image of the camera's absolute pose and data in the world coordinate system. Knowing the difference in the true value of the pose, adjust the network parameters of the neural network to minimize the loss function of the neural network, and finally train the desired target neural network.

In some optional embodiments, the embodiments of the present disclosure are based on the above-mentioned camera positioning method, and also provide a framework diagram of a target neural network. For example, as shown in FIG. 15, it includes Probabilistic Dropout Module (Partial Pixel Dropout Module), Feature Ectractor Module (feature extraction module), Self-attention Module (self-attention module) and Regressor Module (regression module).

In the training process of the target neural network, at least one sample image may be used as the input value of the partial pixel discarding module, and the partial pixel discarding module may be composed of at least five sub-networks connected in sequence. Each sub-network can be implemented separately by using network units set in a preset order, such as a convolutional layer, a Relu layer, and a pooling layer.

The first sub-network can perform pixel-level semantic segmentation on each image in at least one sample image; the second sub-network can determine the first part of each sample image belonging to the movable object according to the result of pixel-level semantic segmentation. A pixel and a second pixel belonging to the background; the third sub-network can determine an image template as large as the sample image based on the statistical distribution of the first pixel and the second pixel in each sample image. The prior probability of the movable object appearing at each pixel in the plurality of pixels; the fourth sub-network may sample the prior probability corresponding to at least some of the pixels included in at least one sample image to obtain the original The sampling result of the sub-sampling; the fifth sub-network can remove the pixel points with the sampling value of the prior probability greater than the preset threshold T in at least one sample image according to the sampling result of this time to obtain the target image.

The feature extraction module can be designed by stacking network units set in a preset order such as convolutional layer, Relu layer, pooling layer, etc. according to the preset structure, and extract the feature parameters in the target image obtained by Probabilistic Dropout Module to obtain feature extraction image.

The self-attention module can also be composed of at least two separate fifth and sixth sub-networks. Each sub-network includes a convolutional layer, a Relu layer, a pooling layer and other network units set in a preset order. The fifth The sub-network can focus on the preset spatial dimension, and the sixth sub-network can focus on the preset channel dimension. After passing through the above two sub-networks, the weight value of the second pixel that belongs to the background in the feature extraction image can be adjusted. The embodiment of the present disclosure does not limit the sequence of the fifth sub-network and the sixth sub-network.

The regression module may include a seventh sub-network, the seventh sub-network may include a convolutional layer, a Relu layer, a pooling layer and other network units set in a preset order, and the seventh sub-network takes the image output by the self-attention module as input value , Taking the known pose of the camera that collects at least one sample image as the output value, and the seventh sub-network corresponds to a regression loss function. The regression loss function can include mean square error loss function (such as L2 loss function), average absolute error (such as L1 loss function), smooth average absolute error loss function (such as Huber loss function), log hyperbolic cosine loss function, and score Bit loss function, etc.

In the above embodiment, the target neural network finally obtained reduces the focus on the movable objects on the sample image, and pays more attention to the background pixels on the sample image, that is, the information of immobile or fixed objects. The impact of the pixel points corresponding to the object on the overall imaging quality of the image improves the robustness of the target neural network.

Corresponding to the foregoing method embodiment, the present disclosure also provides an embodiment of a camera positioning device.

The embodiments of the present disclosure also provide a camera positioning device, which can be applied to movable machinery and equipment. Since the movable electronic equipment will move, the pose of the camera set on the movable machinery and equipment will change accordingly. The high accuracy of camera positioning can improve the accuracy of mobile machinery and equipment when performing various tasks.

As shown in FIG. 16, FIG. 16 is a block diagram of a camera positioning device according to an exemplary embodiment of the present disclosure. The device includes: an acquisition module 310, configured to acquire each of the multiple pixels included in the image template The prior probability of a movable object appearing at the location; the execution module 320 is configured to perform the operation of discarding some pixels according to the prior probability for the image to be processed as large as the image template to obtain the target image; the positioning module 330, It is used to determine the absolute pose of the camera that collects the image to be processed in the world coordinate system according to the target image.

In some embodiments, as shown in FIG. 17, for example, the acquisition module 310 includes: a segmentation sub-module 311, configured to perform pixel-level semantic segmentation on each image in a predetermined image set; and a first determination sub-module 312, configured to According to the result of pixel-level semantic segmentation, determine the first pixel that belongs to the movable object and the second pixel that belongs to the background in each image; the second determination sub-module 313 is used to determine based on the The statistical distribution of the first pixel point and the second pixel point is determined, and all the pixels where the movable object appears at each of the multiple pixel points included in the image template as large as the image in the predetermined image set are determined. State the prior probability.

In some embodiments, for example, as shown in FIG. 18, the execution module 320 includes: a sampling sub-module 321, configured to sample the prior probability corresponding to at least some of the pixels included in the image to be processed; and execute; The sub-module 322 is configured to remove the pixel points whose a priori probability sampling value is greater than a preset threshold on the image to be processed to obtain the target image.

In some embodiments, when the number of sampling times is multiple, there is at least one different pixel point between each of the multiple target images obtained after the operation of discarding some pixels.

In some embodiments, as shown in FIG. 19, the positioning module 330 includes a second positioning sub-module 331 for inputting the image to be processed into the target neural network, and obtaining the camera of the image to be processed The absolute pose in the world coordinate system.

In some embodiments, the image to be processed includes at least two frames of images with time sequence captured by the camera; for example, as shown in FIG. 20, the device further includes: a first determining module 340, configured to Two frames of images determine the relative pose of the camera when shooting the at least two frames of images; the second determining module 350 is configured to determine the correction of the camera according to the relative pose of the camera and the absolute pose Posture.

In some embodiments, such as shown in FIG. 21, the second determining module 350 further includes: a third determining sub-module 351, configured to determine the deterministic probability of the absolute pose; and a fourth determining sub-module 352, configured to In order to determine the first weight of the relative pose and the second weight of the absolute pose according to the deterministic probability; a fifth determination sub-module 353 is configured to determine the relative pose, the first weight, The absolute pose and the second weight determine the corrected pose of the camera.

In some optional embodiments, the present disclosure also provides a camera positioning device, which can be applied to an electronic device, and the electronic device can train a neural network to obtain a target neural network. After the image is subsequently input to the target neural network, the absolute pose of the camera that collected the image in the world coordinate system can be obtained.

As shown in FIG. 22, FIG. 22 is a block diagram of a camera positioning device according to an exemplary embodiment of the present disclosure. The device includes: an acquisition module 410, configured to acquire each of the multiple pixels included in the image template The prior probability of a movable object appearing at the location; the execution module 420 is configured to perform the operation of discarding some pixels according to the prior probability for the image to be processed as large as the image template to obtain the target image; the positioning module 430, It is used to determine the absolute pose of the camera that collects the image to be processed in the world coordinate system according to the target image.

In some embodiments, such as shown in FIG. 23, the acquisition module 410 includes: a segmentation sub-module 411, configured to perform pixel-level semantic segmentation on each image in a predetermined image set; and a first determining sub-module 412, configured to According to the result of pixel-level semantic segmentation, the first pixel that belongs to the movable object and the second pixel that belongs to the background in each image are determined; the second determination sub-module 413 is used to determine the The statistical distribution of the first pixel point and the second pixel point is determined, and all the pixels where the movable object appears at each of the multiple pixels included in the image template that is as large as the image in the predetermined image set are determined. State the prior probability.

In some embodiments, for example, as shown in FIG. 24, the execution module 420 includes: a sampling sub-module 421, configured to sample the prior probability corresponding to at least some pixels included in the image to be processed; execute; The sub-module 422 is configured to remove pixels with a priori probability sampling value greater than a preset threshold on the image to be processed to obtain the target image.

In some embodiments, such as shown in FIG. 25, the positioning module 430 includes: a first processing sub-module 431 for extracting feature parameters in the target image via a neural network to obtain a feature extraction image; a second processing sub-module The module 432 is configured to increase the weight value corresponding to the second pixel point belonging to the background in the feature extraction image in the preset space dimension and/or the preset channel dimension of the neural network; the first positioning sub-module 433 , Used to analyze the feature extraction image after the weight value is adjusted by the neural network, and obtain the absolute pose of the camera that collects the image to be processed in the world coordinate system.

In some embodiments, for example, as shown in FIG. 26, the device further includes: a training module 440 configured to determine the true value of the camera's pose and pose according to the absolute pose and the predetermined image to be processed. Differences, adjust the network parameters of the neural network, and train the target neural network.

For the device embodiment, since it basically corresponds to the method embodiment, the relevant part can refer to the part of the description of the method embodiment. The device embodiments described above are merely illustrative. The units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place. , Or it can be distributed to multiple network units. Some or all of the modules can be selected according to actual needs to achieve the objectives of the solutions of the present disclosure. Those of ordinary skill in the art can understand and implement it without creative work.

The embodiment of the present disclosure also provides a computer-readable storage medium, the storage medium stores a computer program, and the computer program is used to execute any of the above-mentioned camera positioning methods.

The embodiment of the present disclosure also provides a camera positioning device, the device includes: a processor; a memory for storing executable instructions of the processor; wherein the processor is used for calling the executable instructions stored in the memory to implement any one of the aforementioned cameras Positioning method.

The camera positioning device provided in the embodiments of the present disclosure can implement the method provided in any of the foregoing embodiments. The camera positioning device can discard some pixels in the image to be processed that are as large as the image template according to the prior probability of a movable object appearing at each of the multiple pixels included in the image template, and then according to the obtained The target image is used to determine the absolute pose of the camera, which reduces the influence of the movement of the object in the scene where the camera collects the image on the result of the positioning of the camera on the movable machinery and equipment, and improves the accuracy of the camera positioning.

The camera positioning device provided by the embodiments of the present disclosure can be applied to movable machinery and equipment to locate cameras provided on the movable machinery and equipment. Since the movable machinery and equipment will move, the pose of the camera set on the equipment will change accordingly. The accuracy of camera positioning can improve the accuracy of mobile machinery and equipment when performing various tasks. For example, according to the image of the forward environment of the vehicle collected by the camera installed on the unmanned vehicle, the current location information of the camera can be determined, and the current location information of the vehicle can be located according to the location information of the camera, so that the unmanned vehicle can be The vehicle performs at least one intelligent driving control such as path planning, trajectory tracking, and collision warning.

The camera positioning device provided by the present disclosure can also be used on electronic devices for training neural networks, such as cloud platforms, neural network training platforms, and the like. The electronic device uses this method to train the neural network to obtain the target neural network. After the image is subsequently input to the target neural network, the absolute pose of the camera that collected the image in the world coordinate system can be obtained.

As shown in Fig. 27, Fig. 27 is a schematic structural diagram of an electronic device 2700 according to an exemplary embodiment. The electronic device 2700 includes movable machinery and a cloud platform for training neural networks.

27, the electronic device 2700 includes a processing component 2722, which further includes one or more processors, and a memory resource represented by a memory 2732 for storing instructions executable by the processing component 2722, such as application programs. The application program stored in the memory 2732 may include at least one module, and each module corresponds to a set of instructions. In addition, the processing component 2722 is used to execute instructions to execute any of the aforementioned camera positioning methods.

The electronic device 2700 may further include a power component 2726 for performing power management of the electronic device 2700, a wired or wireless network interface 2750 for connecting the electronic device 2700 to a network, and an input output (I/O) interface 2758. The electronic device 2700 can operate based on an operating system stored in the memory 2732, such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeB SDTM or the like. When the electronic device 2700 is a movable machine device, the electronic device 2700 further includes a camera for capturing images. When the electronic device 2700 is a cloud platform for training a neural network, the electronic device can communicate with a mobile machine device through the input and output interface 2758.

Those skilled in the art will easily think of other embodiments of the present disclosure after considering the specification and practicing the invention disclosed herein. The present disclosure is intended to cover any variations, uses, or adaptive changes of the present disclosure. These variations, uses, or adaptive changes follow the general principles of the present disclosure and include common knowledge or conventional technical means in the technical field not disclosed in the present disclosure. . The description and the embodiments are to be regarded as exemplary only, and the true scope and spirit of the present disclosure are pointed out by the following claims.

The above descriptions are only preferred embodiments of the present disclosure, and are not intended to limit the present disclosure. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present disclosure shall be included in the present disclosure Within the scope of protection.

Claims

A camera positioning method includes:

Obtain the prior probability of a movable object appearing at each of the multiple pixels included in the image template;

Performing an operation of discarding some pixels for an image to be processed that is as large as the image template according to the prior probability to obtain a target image;

The absolute pose of the camera that collects the image to be processed in the world coordinate system is determined according to the target image.
The method according to claim 1, wherein obtaining the prior probability of a movable object at each of the plurality of pixels included in the image template comprises:

Perform pixel-level semantic segmentation on each image in the predetermined image set;

Determining, according to the result of pixel-level semantic segmentation, the first pixel that belongs to the movable object and the second pixel that belongs to the background in each image;

Based on the statistical distribution of the first pixel point and the second pixel point in each image, determine each pixel point in a plurality of pixels included in an image template that is as large as an image in the predetermined image set The prior probability that the movable object appears at.
The method according to claim 1 or 2, wherein the step of discarding some pixels of the image to be processed according to the prior probability to obtain the target image comprises:

Sampling the prior probabilities corresponding to at least some of the pixels included in the image to be processed;

Remove the pixel points with a priori probability sampling value greater than a preset threshold from the image to be processed to obtain the target image.
The method according to claim 3, wherein when the number of sampling times is multiple times, there is at least one different pixel point between the multiple target images obtained after the operation of discarding some pixels.
The method according to any one of claims 1 to 4, wherein determining, according to the target image, the absolute pose of the camera that collects the image to be processed in the world coordinate system comprises:

Extracting feature parameters in the target image through a neural network to obtain a feature extraction image;

Adding a weight value corresponding to a second pixel that belongs to the background in the feature extraction image on the preset space dimension and/or the preset channel dimension of the neural network;

The neural network analyzes the feature extraction image after the weight value is adjusted to obtain the absolute pose of the camera that collects the image to be processed in the world coordinate system.
The method according to claim 5, wherein the feature extraction image after the weight value adjustment is analyzed by the neural network to obtain the absolute pose of the camera collecting the image to be processed in the world coordinate system After that, the method further includes:

Adjust the network parameters of the neural network according to the difference between the absolute pose and the predetermined true value of the camera that collects the image to be processed to obtain a target neural network through training.
The method according to claim 6, wherein determining the absolute pose of the camera that collects the image to be processed in the world coordinate system according to the target image comprises:

The image to be processed is input into the target neural network to obtain the absolute pose of the camera that collects the image to be processed in the world coordinate system.
The method according to any one of claims 1-7, wherein the image to be processed comprises at least two frames of images with time series collected by the camera;

After determining the absolute pose of the camera that collects the image to be processed in the world coordinate system according to the target image, the method further includes:

Determining, according to the at least two frames of images, the relative pose of the camera when shooting the at least two frames of images;

Determine the corrected pose of the camera according to the relative pose of the camera and the absolute pose.
The method according to claim 8, wherein determining the corrected pose of the camera according to the relative pose of the camera and the absolute pose comprises:

Determining the certainty probability of the absolute pose;

Determining the first weight of the relative pose and the second weight of the absolute pose according to the certainty probability;

Determine the corrected pose of the camera according to the relative pose, the first weight, the absolute pose, and the second weight.
A camera positioning device includes:

An acquisition module for acquiring the prior probability of a movable object appearing at each of the multiple pixels included in the image template;

The execution module is configured to perform an operation of discarding some pixels for an image to be processed that is as large as the image template according to the prior probability to obtain a target image;

The positioning module is configured to determine, according to the target image, the absolute pose of the camera that collects the image to be processed in the world coordinate system.
The device according to claim 10, wherein the acquisition module comprises:

The segmentation sub-module is used to perform pixel-level semantic segmentation on each image in the predetermined image set;

The first determining submodule is configured to determine the first pixel that belongs to the movable object and the second pixel that belongs to the background in each image according to the result of pixel-level semantic segmentation;

The second determining sub-module is configured to determine, based on the statistical distribution of the first pixel points and the second pixel points in each image, how many image templates that are as large as the images in the predetermined image set include The prior probability that the movable object appears at each pixel in each pixel.
The device according to claim 10 or 11, wherein the execution module comprises:

The sampling sub-module is used to sample the prior probabilities corresponding to at least some pixels included in the image to be processed;

The execution sub-module is used to remove the pixel points whose a priori probability sampling value is greater than a preset threshold on the image to be processed to obtain the target image.
The device according to claim 12, wherein when the number of sampling times is multiple times, there is at least one different pixel point between the multiple target images obtained after the operation of discarding some pixels.
The device according to any one of claims 10-13, wherein the positioning module comprises:

The first processing sub-module is used to extract feature parameters in the target image via a neural network to obtain a feature extraction image;

The second processing sub-module is configured to increase the weight value corresponding to the second pixel point belonging to the background in the feature extraction image in the preset space dimension and/or the preset channel dimension of the neural network;

The first positioning sub-module is configured to analyze the feature extraction image after the weight value is adjusted by the neural network to obtain the absolute pose of the camera that collects the image to be processed in the world coordinate system.
The device according to claim 14, wherein the device further comprises:

The training module is configured to adjust the network parameters of the neural network according to the difference between the absolute pose and the predetermined true value of the camera that collects the image to be processed, to train the target neural network.
The device according to claim 15, wherein the positioning module comprises:

The second positioning sub-module is used to input the image to be processed into the target neural network to obtain the absolute pose of the camera of the image to be processed in the world coordinate system.
The device according to any one of claims 10-16, wherein the image to be processed comprises at least two frames of images with time series collected by the camera;

The device also includes:

The first determining module is configured to determine the relative pose of the camera when shooting the at least two frames of images according to the at least two frames of images;

The second determining module is configured to determine the corrected pose of the camera according to the relative pose of the camera and the absolute pose.
The device according to claim 17, wherein the second determining module further comprises:

The third determining sub-module is used to determine the certainty probability of the absolute pose;

A fourth determination submodule, configured to determine the first weight of the relative pose and the second weight of the absolute pose according to the certainty probability;

The fifth determining sub-module is configured to determine the corrected pose of the camera according to the relative pose, the first weight, the absolute pose, and the second weight.
A computer-readable storage medium, the computer-readable storage medium stores a computer program, and the computer program is used to execute the camera positioning method according to any one of claims 1-9.
An electronic device including:

processor;

A memory for storing executable instructions of the processor;

Wherein, the processor is configured to call executable instructions stored in the memory to implement the camera positioning method according to any one of claims 1-9.