CN109947886B

CN109947886B - Image processing method, image processing device, electronic equipment and storage medium

Info

Publication number: CN109947886B
Application number: CN201910209109.9A
Authority: CN
Inventors: 张润泽; 贾佳亚; 戴宇荣; 沈小勇
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2019-03-19
Filing date: 2019-03-19
Publication date: 2023-01-10
Anticipated expiration: 2039-03-19
Also published as: CN109947886A

Abstract

The invention discloses an image processing method and device, electronic equipment and a storage medium, and belongs to the technical field of computers. The method comprises the following steps: acquiring a plurality of image frames; for any image frame, acquiring multiple items of position difference information corresponding to the image frame, wherein the multiple items of position difference information are used for indicating the position difference of matching feature points of the image frame and a previous key frame of the image frame; when any one of the plurality of items of position difference information meets the accuracy requirement of scene creation, acquiring any image frame as a key frame; and creating a target virtual scene based on the acquired plurality of key frames. According to the method, when the key frames are acquired, a plurality of pieces of position difference information are considered, when any one meets the accuracy requirement of scene creation, any image frame is acquired as the key frame, so that enough key frames are acquired, the problem that the created virtual scene is inaccurate due to positioning failure is avoided, and the accuracy of the virtual scene acquired by the image processing method is good.

Description

Image processing method, image processing device, electronic equipment and storage medium

Technical Field

The present invention relates to the field of computer technologies, and in particular, to an image processing method and apparatus, an electronic device, and a storage medium.

Background

With the development of computer technology, the real-time positioning And map creation (SLAM) technology is widely applied to the fields of unmanned driving, robots, virtual reality or augmented reality And the like. The SLAM technology can be applied to the field of automatic driving, and can process a plurality of image frames shot on a vehicle to create a scene.

Currently, an indirect SLAM technology is generally adopted in the image processing method, which is based on an ORB (organized FAST And hosted client) feature real-time Localization And map creation (ORB-SLAM) technology, in which, for a plurality of image frames, a plurality of keyframes are selected from the plurality of image frames by an algorithm according to certain information of a feature point matched by each image frame with respect to a previous keyframe before the image frame, and a virtual scene corresponding to the plurality of image frames is created according to the plurality of keyframes.

In the image processing method, the algorithm used for acquiring the key frames is rough, and often, the number of the selected key frames is not enough, so that the positioning of the equipment for acquiring the image frames or some road mark points fails, and the accuracy of the created virtual scene is poor.

Disclosure of Invention

The embodiment of the invention provides an image processing method, an image processing device, electronic equipment and a storage medium, and can solve the problem of poor accuracy of a virtual scene in the related art. The technical scheme is as follows:

in one aspect, an image processing method is provided, and the method includes:

acquiring a plurality of image frames;

for any image frame, acquiring a plurality of items of position difference information corresponding to the image frame, wherein the plurality of items of position difference information are used for representing the position difference of matched feature points of the image frame and a previous key frame of the image frame;

when any one of the plurality of items of position difference information meets the accuracy requirement of scene creation, acquiring any one image frame as a key frame;

and creating a target virtual scene based on the acquired key frames, wherein the target virtual scene is used for representing a scene corresponding to the image frames.

In one possible implementation, the method further includes:

for any feature point of any image frame, acquiring the spatial position of a matched feature point in a camera coordinate system of a second historical image frame according to the depth of the matched feature point of the any feature point in the second historical image frame, wherein the second historical image frame is the first image frame in a plurality of image frames matched with the any feature point;

acquiring a predicted relative direction of the any feature point relative to the camera position of the any image frame according to the spatial position of the matched feature point in a camera coordinate system of a second historical image frame, the camera pose of the second historical image frame and the relative camera pose of the any image frame relative to the second historical image frame;

acquiring a relative direction of the any feature point with respect to a camera position of the any image frame;

and taking the difference between the projected positions of the predicted relative direction and the relative direction in a plane perpendicular to the relative direction as the error of any one feature point.

In one aspect, an image processing apparatus is provided, the apparatus including:

the image acquisition module is used for acquiring a plurality of image frames;

the information acquisition module is used for acquiring a plurality of items of position difference information corresponding to any image frame, wherein the plurality of items of position difference information are used for representing the position difference of matched feature points of the image frame and a previous key frame of the image frame;

the image acquisition module is further used for acquiring any image frame as a key frame when any of the plurality of items of position difference information meets the accuracy requirement of scene creation;

the scene creating module is used for creating a target virtual scene based on the acquired key frames, and the target virtual scene is used for representing a scene corresponding to the image frames.

In one aspect, an electronic device is provided that includes one or more processors and one or more memories having at least one instruction stored therein, the instruction being loaded and executed by the one or more processors to implement operations performed by the image processing method.

In one aspect, a computer-readable storage medium is provided, in which at least one instruction is stored, the instruction being loaded and executed by a processor to implement the operations performed by the image processing method.

In the embodiment of the invention, when the key frames are acquired from a plurality of image frames, when judging whether any image frame is acquired as the key frame, a plurality of pieces of position difference information of the characteristic points matched with the any image frame and the previous key frame before the any image frame are considered, when any image frame meets the accuracy requirement of scene creation, the any image frame is acquired as the key frame so as to acquire enough key frames, and the problem that the created virtual scene is inaccurate due to failure in positioning of equipment for acquiring the image frames or landmark points can be avoided, so that the accuracy of the target virtual scene obtained by the image processing method provided by the embodiment of the invention is good.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is an implementation environment of an image processing method according to an embodiment of the present invention;

FIG. 2 is a flow chart of an image processing method according to an embodiment of the present invention;

FIG. 3 is a diagram illustrating an image frame according to an embodiment of the present invention;

fig. 4 is a schematic diagram of feature extraction of an image frame according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of a feature matching process provided by an embodiment of the present invention;

FIG. 6 is a schematic diagram of a motion trajectory of a vehicle and a sparse point cloud map according to an embodiment of the present invention;

FIG. 7 is a diagram illustrating an image processing flow according to an embodiment of the present invention;

fig. 8 is a schematic diagram of an application scenario of an image processing method according to an embodiment of the present invention;

fig. 9 is a schematic structural diagram of an image processing apparatus according to an embodiment of the present invention;

fig. 10 is a schematic structural diagram of a terminal according to an embodiment of the present invention;

fig. 11 is a schematic structural diagram of a server according to an embodiment of the present invention.

Detailed Description

To make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

Fig. 1 is an implementation environment of an image processing method provided by an embodiment of the present invention, where the image processing method may include two implementation environments, and in one possible implementation, referring to fig. 1, the implementation environment may include an electronic device 101 and an electronic device 102. The electronic device 101 and the electronic device 102 may be connected through a data line or a wireless network for data interaction. The electronic device 101 has an image capturing function for capturing image frames, and the electronic device 102 has an image processing function for processing the image frames captured by the electronic device 101 to create a target virtual scene corresponding to the image frames.

The target virtual scene may include one or more types of data, and the target virtual scene may be set by a related technician as needed. In one possible embodiment, the electronic device 102 may create a virtual map of an object corresponding to an acquired image frame based on the image frame. In another possible embodiment, the electronic device 102 may also position the electronic device 101 based on the acquired image frames, so as to obtain a target motion trajectory of the electronic device 101. In yet another possible embodiment, the electronic device 102 may both create a virtual map of the target and obtain a motion trajectory of the target.

In a specific embodiment, the electronic device 101 may also be installed on a positioning target, for example, when a vehicle is desired to be positioned, the electronic device 101 may be installed on the vehicle, so that the position of the electronic device 101 is the position of the positioning target, and thus, the electronic device 102 may position the electronic device 101 to achieve positioning of the positioning target. For example, in the field of automatic driving, the electronic device 101 mounted on the vehicle may capture image frames and transmit the captured image frames to the electronic device 102, and the electronic device 102 may process the captured image frames to create a virtual map around a place where the vehicle travels, locate the vehicle, or both locate and create a virtual map around the place.

In another possible implementation manner, referring to fig. 1, the implementation environment may include the electronic device 102, that is, the electronic device 101 and the electronic device 102 are the same electronic device. The electronic device 102 has an image capture function and an image processing function. The electronic device 102 may collect and process image frames, locate itself, create a virtual map of the surroundings, or both capture a virtual scene and locate itself. For example, in one particular example, the electronic device 102 may be a robot that captures image frames while moving and processes the captured image frames, creates a virtual map of the surroundings, or positions itself, or both.

It should be noted that the electronic device may be a terminal or a server, which is not limited in this embodiment of the present invention.

Fig. 2 is a flowchart of an image processing method provided in an embodiment of the present invention, where the method is applied to an electronic device, which may be the electronic device 102 shown in fig. 1, and referring to fig. 2, the method may include the following steps:

201. electronic device acquiring a plurality of image frames _。

In the embodiment of the present invention, the electronic device may have an image processing function, and the electronic device may process a plurality of image frames acquired by the electronic device to create a target virtual scene corresponding to the plurality of image frames. Specifically, the electronic device may position a device that acquires the multiple image frames to obtain a target motion trajectory of the device, or may create a target virtual map corresponding to the multiple image frames, or may create the target virtual map and obtain the target motion trajectory, where the target virtual scene may be set by a related technician as needed, which is not limited in the embodiment of the present invention.

In one possible implementation, the plurality of image frames may be transmitted to the electronic device by a device that acquired the plurality of image frames. That is, the image capturing device may capture a plurality of image frames and send the captured image frames to the electronic device, and of course, the image capturing device may also capture a video and send the captured video to the electronic device, so that the electronic device may process the received video to obtain the image frames. Specifically, the electronic device may perform frame-capturing processing on the video, capture the plurality of image frames, or directly extract the plurality of image frames included in the video, which is not limited in the embodiment of the present invention.

In another possible implementation manner, the electronic device may have an image capturing function, and the electronic device may capture a plurality of image frames or capture a video, so as to process the captured video to obtain a plurality of image frames.

In the above two implementation manners, the plurality of image frames may be image frames acquired in real time, image frames acquired and stored in advance, video captured in real time may be processed, and video recorded in advance may also be processed, which is not limited in the embodiment of the present invention. Of course, the plurality of image frames may also be obtained in other manners, for example, may be obtained from an image database, and the obtaining manner of the plurality of image frames is not limited in the embodiment of the present invention.

In a specific example, the image processing method may be applied to the field of automatic driving, and taking as an example that the plurality of image frames are collected by the image collecting device in real time and sent to the electronic device, both the image collecting device and the electronic device may be installed on a vehicle, or the image collecting device may be installed on a vehicle, and the electronic device is any device other than the vehicle, and the device may communicate with an on-vehicle device to control driving of the vehicle. For example, each time the image capturing device captures one image frame, the image frame may be sent to the electronic device, the electronic device processes the image frame, and based on the image frame, the vehicle is positioned or a virtual map around the driving path is created, or both steps are performed, so that path planning and the like may be performed subsequently based on the virtual map and the positioning.

202. The electronic equipment extracts the features of the image frames to obtain feature points of each image frame.

After acquiring the plurality of image frames, the electronic device may process the plurality of image frames. For each image frame, the electronic device may perform feature extraction on each image frame to obtain feature points of each image frame, so that the subsequent electronic device may analyze spatial positions of the feature points, analyze landmark point distribution of surrounding scenes, or analyze positions of devices acquiring the image frames based on the feature points.

In one possible implementation, the electronic device may extract ORB (organized FAST and organized BRIEF) features for each image frame, which may consist of two parts, a key point of the ORB feature being "organized FAST", the organized FAST being a modified FAST corner, and a descriptor using "organized BRIEF", the organized BRIEF being a modified BRIEF descriptor. The ORB is an efficient visual feature descriptor that contains orientation and rotation information. The ORB features have scale invariance, rotation invariance and the like, have good performance under the transformation of translation, rotation and scaling, and are fast in extraction and matching speed, so that the speed, the accuracy and the efficiency of processing the image frames can be effectively improved, the accuracy of the obtained target virtual scene is improved, and the efficiency is also high.

Specifically, the step 202 can be implemented by the following steps one and two:

step one, the electronic equipment extracts the corner points of each image frame.

The corner is where the local pixel value change is significant. For any pixel point of any image frame, when the pixel value difference between the pixel point and the pixel point in the target area around the pixel point is larger than a threshold value, the electronic equipment takes the pixel point as an original angular point. The electronic device may obtain a Harris (Harris) response value of each original corner point, and use the original corner point with the largest number of previous targets as the corner point of any image frame. After the electronic device preliminarily extracts the angular points, an image pyramid can be established according to the scale factor and the number of layers of the pyramid, the angular points of each image frame are obtained by performing angular point detection on the zoomed image on each layer of pyramid, and the zoomed images on each layer of pyramid have different scales. After obtaining the corner points, the electronic device may obtain direction information of each corner point. Specifically, the obtaining process of the direction information may be implemented by a moment method, and for any corner point, the electronic device may use the corner point as a starting point, and use a vector, in which a centroid of the image block taking each corner point as a geometric center is used as an end point, as the direction information of the corner point.

The target area may be preset by a relevant technician, for example, the target area may be a circle with the pixel point as a center and a radius of a preset radius, which is not limited in the embodiment of the present invention. The above is only an exemplary illustration, and the process of corner extraction may be implemented by any corner detection algorithm, which is not limited in the embodiment of the present invention.

And step two, the electronic equipment describes the target image area of the extracted corner points.

For any corner point, the electronic device may acquire a target image region of the corner point on an image of a scale corresponding to the corner point, select a pixel point pair from the target image region, rotate the selected point pair according to direction information of the corner point to obtain a rotated pixel point pair, and acquire description information of the corner point according to a pixel value size relationship between two pixel points in the pixel point pair.

For example, as shown in fig. 3, after acquiring an image frame shown in fig. 3, the electronic device may perform feature extraction on the image frame to obtain feature points shown in fig. 4, where as shown in fig. 4, each circle represents a feature point, a center of the circle is a position of the feature point in the image frame, a size of the circle represents a scale of the feature point, and a line segment in the circle is used to represent direction information of the feature.

The first step and the second step are processes of extracting ORB features, and are merely an exemplary illustration of the step 202, and in a possible implementation manner, the electronic device may also obtain other features of the image frame, which is not limited by the embodiment of the present invention.

In one possible implementation, when the types of cameras acquiring the plurality of image frames are different, the processing of the plurality of image frames by the electronic device may be different. For example, the type of camera may include a monocular camera, a binocular camera, an RGBD (Red Green Blue Depth) camera, etc., and of course, the type of camera may also include other types, and the embodiments of the present invention are not enumerated herein. After the step 202, when the camera acquiring the image frames is a binocular camera, the electronic device may further match, through a separate thread, feature points of two image frames respectively acquired by a first camera and a second camera in the binocular camera. The depth of the feature point can be obtained by the image frames obtained by two cameras in the binocular camera, and the spatial position of the feature point can be determined based on the depth.

The step of extracting features in step 202 may be implemented by a first thread, and the procedure of binocular matching may be implemented based on a second thread, where the first thread is different from the second thread. Therefore, different steps are carried out through different threads, and the electronic equipment can execute different steps in parallel, so that the image processing speed is increased, and the image processing efficiency is improved.

For example, the binocular camera may be a camera with internal parameters calibrated, the binocular camera may include a first camera and a second camera, and taking the first camera as a left camera and the second camera as a right camera as an example, the binocular matching process may be: and acquiring a baseline of the feature points of the first image frame acquired by the first camera in the second image frame acquired by the second camera, and sequentially matching the feature points on the baseline along the baseline to obtain the depth of each matched feature point.

203. The electronic device matches the feature points of each image frame with feature points of a previous image frame of the each image frame.

In the embodiment of the present invention, the plurality of image frames may be consecutive multi-frame image frames, the pose of the camera changes, and the content in the captured image frame changes, so that the positions of the feature points of different image frames may change. After the electronic equipment acquires the feature points of each image frame, the change condition of the pose of the camera can be determined according to the position change of the feature points by matching the feature points of the adjacent image frames, so that the function of positioning the equipment for acquiring the plurality of image frames can be realized. In addition, by matching the feature points of the adjacent image frames, the spatial position of the feature point can also be determined according to the positions of the feature point in different image frames.

In one possible implementation, the camera pose may be represented by a six-dimensional vector, the camera pose may be represented by a translation matrix and a rotation matrix, the translation matrix is used to represent the camera position in the original coordinate system, and the rotation matrix is used to represent the rotation angle of the current pose of the camera in the camera coordinate system.

In a possible implementation manner, when the electronic device performs a matching process on feature points of adjacent image frames, a camera pose of a current image frame may be predicted according to a historical image frame, so that the predicted camera pose is used to guide the current image frame to be matched with feature points of a previous image frame of the current image frame, and a feature matching speed may be increased. Specifically, the matching process can be realized by the following steps one and two:

step one, for each image frame, the electronic equipment predicts the camera pose of each image frame according to the camera pose of a first historical image frame before each image frame.

And step two, the electronic equipment matches the feature point of each image frame with the feature point of the previous image frame of each image frame according to the camera pose of each image frame to obtain the spatial position of the feature point of each image frame.

In the first and second steps, the first history image frame may be one or more image frames prior to each image frame. In a possible embodiment, after the second step, the electronic device may repeatedly execute the first step and the second step according to the matching result, optimize the camera pose according to the spatial position of the feature point obtained in the second step, and re-guide the matching process based on the optimized camera pose, so as to obtain an accurate camera pose and the spatial position of the feature point until a certain condition is met. In the related art, the camera pose of each image frame is usually guessed directly from the previous image frame and the next image frame of each image frame, and the accuracy is poor. In the embodiment of the invention, the camera pose is predicted, so that the feature matching process is guided according to the camera pose, the matching speed can be increased, and the efficiency of the whole image processing flow is improved.

In one possible implementation, this step one may be implemented by a kalman filter. Specifically, the electronic device may process the camera pose of a first historical image frame before each image frame through a kalman filter, so as to obtain the camera pose of each image frame. The electronic device may predict the spatial position of the feature point of each image frame based on the camera pose of each image frame, the camera pose of the previous image frame, and the spatial position of the feature point of the previous image frame, so as to perform matching based on the predicted spatial position and the feature point of each image frame, and the matching process may be to determine the similarity between the feature points, so as to determine whether to match, which is not repeated herein in this embodiment of the present invention.

For example, as shown in fig. 5, for any image frame (current image frame), the electronic device may process a first historical image frame before the current image frame through a kalman filter to obtain a predicted value and a predicted covariance, direct, by the electronic device, feature matching through the predicted value, and optimize a camera pose (camera pose) of the current image frame based on a matching result, so that a parameter value and a predicted covariance of the kalman filter may be updated based on the optimized camera pose, and repeatedly perform the multiple steps until the predicted covariance is minimum or the predicted covariance converges.

In a specific embodiment, in the

steps

202 and 203, the electronic device may use different threads to perform feature extraction on the image frames and match the feature points extracted from the image frames, respectively. That is, the electronic device may use different threads to perform the step 202 and the step 203, respectively, so that the step 202 and the step 203 may be performed in parallel during the process of processing a plurality of image frames, so as to save the image processing time and improve the image processing speed and efficiency.

For example, the electronic device can perform step 202 based on a first thread and perform step 203 based on a third thread, the first thread and the third thread being different. In one possible implementation manner of the step 203, the step of binocular matching may be implemented based on a single thread (for example, a second thread), it should be noted that the first thread, the second thread, and the third thread are all different, that is, the

steps

202, 203, and the step of binocular matching may all be implemented by using different threads, so that the three steps may all be implemented in parallel.

In the related art, the three steps are usually realized by adopting the same thread, the time spent for sequentially executing the three steps is long, the whole image processing flow is directly slowed, and through tests, the three steps are independent, so that the whole speed of the image processing flow can be improved by more than one time.

The independence of the three steps also facilitates the extension of other steps, which may also be included in the image processing method when the image processing method is applied to different scenes, and which may also be processed in parallel with the three steps. For example, the image processing method may be applied to the field of Virtual Reality (VR) and the field of Augmented Reality (AR), and may implement different functions with an Inertial Measurement Unit (IMU), such as positioning a camera pose in real time by the image processing method, performing three-dimensional reconstruction of an image according to the camera pose positioned in real time, rapidly modeling a game movie industry, providing a low-cost three-dimensional model for a consumer and three-dimensional (3 dimensions,3 d) printing, and the like. In this application scenario, the integral acquisition step of the IMU may be executed in parallel with the above three steps, thereby increasing the speed of the entire processing flow.

It should be noted that, in the above steps 202 to 203, the process of processing a plurality of image frames corresponding to the electronic device to obtain the camera pose of each image frame and the spatial position of the feature point, where the camera pose of each image frame is obtained by predicting the camera pose of the historical image frame, and may be inaccurate, and after the above step 203, the electronic device may further optimize the predicted camera pose of each image frame to obtain a more accurate camera pose.

Specifically, the optimization process may be: the electronic equipment obtains the sum of errors of all the feature points of each image frame, wherein the error of each feature point is used for indicating the difference between the spatial position of each feature point and the spatial position predicted based on the second historical image frame, and the spatial position of each feature point is determined based on the depth of each feature point. And the electronic equipment adjusts the camera pose of each image frame based on the sum of the errors of all the feature points of each image frame, and stops adjusting until the sum of the errors meets the target condition to obtain the optimized camera pose of each image frame.

Wherein the second history image frame may be an image frame before each image frame, and in one possible implementation, the second history image frame may be a first image frame in a plurality of image frames with feature points matching with any feature point. That is, when the electronic device processes a certain image frame after processing a plurality of image frames and acquires an error of a certain feature point in the certain image frame, the image frame from which the electronic device first extracted the feature point may be used as the second history image frame.

Specifically, the electronic device may first obtain the error of each feature point of each image frame, and then sum up to obtain the sum of the errors of all the feature points of each image frame. And optimizing the camera pose of the image frame by taking the sum of the errors as an optimization target. The meeting of the target condition by the sum of the errors may be set by a relevant technician according to a requirement, for example, the sum of the errors may reach a minimum value, or the sum of the errors converges, and the like, which is not limited in the embodiment of the present invention.

For the error of each feature point of each image frame, the electronic device may acquire the error through the following steps one to four:

step one, for any feature point of any image frame, the electronic equipment acquires the spatial position of the matched feature point in a camera coordinate system of a second historical image frame according to the depth of the matched feature point of the any feature point in the second historical image frame.

In the first step, the depth of the feature point may be used to represent the spatial position of the feature point, and for any feature point, the electronic device may obtain the spatial position of the matching feature point in the camera coordinate system of the second history image frame according to the camera parameters and the depth of the matching feature point of the feature point in the second history image frame. So that the spatial position of the any feature point in the current image frame (i.e. the any image frame) can be predicted subsequently according to the relative relationship between the two image frames.

For example, suppose that any image frame is an image frame j, the feature point r is a feature point in the image frame j, and the second history image frame corresponding to the feature point r is a mapAnd in the image frame i, the matching feature point of the feature point r in the second historical image frame is the feature point q. For the matching feature point q, the camera parameter matrix of the camera model can be normalized into

For a model of a pinhole camera,

k is a camera internal reference matrix, and if the depth of the matching feature point q is d and d is greater than or equal to 0, the spatial position of the matching feature point q can be

In the related art, the reprojection error is generally adopted as an optimization objective function in the optimization process, the reprojection error is mainly applicable to a pinhole camera model, and generally, the spatial position of the feature point needs to be represented by three parameters, that is, three-dimensional coordinates (X, Y and Z coordinates) are used for representing the spatial position.

And secondly, the electronic equipment acquires the predicted relative direction of any feature point relative to the camera position of any image frame according to the spatial position of the matched feature point in the camera coordinate system of the second historical image frame, the camera pose of the second historical image frame and the relative camera pose of any image frame relative to the second historical image frame.

When comparing the pre-measurement with the real measurement, the relative direction of the feature point with respect to the camera position can be used to measure the error of the spatial position of the feature point. In the second step, the electronic device may convert the above matching feature point from the second history image frame to the any image frame according to the relative relationship between the any image frame and the second history image frame, so as to obtain the predicted amount of the any feature point in the any image frame.

Specifically, the electronic device may acquire the predicted spatial position of the any feature point in the camera coordinate system of the any image frame according to the spatial position of the matched feature point in the camera coordinate system of the second history image frame, the camera pose of the second history image frame, and the relative camera pose of the any image frame with respect to the second history image frame. The electronic device may acquire a predicted relative direction of the any feature point with respect to the camera position of the any image frame based on the predicted spatial position of the any feature point in the camera coordinate system of the any image frame.

For example, it is assumed that any image frame is an image frame j, the feature point r is a feature point in the image frame j, the second history image frame corresponding to the feature point r is an image frame i, and a matching feature point of the feature point r in the second history image frame is a feature point q. Camera pose of image frame j is P _j The relative camera pose of the image frame i relative to the image frame j is P _i In the first step, the spatial position where the matching feature point q has been obtained may be

Then the coordinate of the matching feature point q in the camera coordinate system of image frame j is

That is, the predicted spatial position of the feature point r in the camera coordinate system of the image frame j. The electronic device may determine the predicted relative direction of the feature point r with respect to the camera position according to the coordinates of the feature point r in the camera coordinate system of any image frame j, and specifically, may determine the coordinates as

And carrying out normalization processing to obtain the normalized direction p. Wherein the camera position is an origin position of the camera coordinate system of any image frame j. The direction p is a radial direction from the camera center of any image frame j to the feature point r, i.e. a predicted relative direction of the feature point r with respect to the camera position of any image frame.

And thirdly, the electronic equipment acquires the relative direction of the any characteristic point relative to the camera position of the any image frame.

Through the first step and the second step, the electronic device obtains the predicted relative direction of any feature point in any image frame through the spatial position of the matched feature point of any feature point and the relative relation of the two image frames, and then the electronic device can also obtain the real relative direction of any feature point in any image frame, so that whether the currently determined spatial position of the feature point is accurate or not and whether the current spatial position of the feature point is consistent with the camera pose change condition between the two image frames or not can be compared.

Through the above feature matching process, the electronic device can obtain the spatial position of any feature point after matching any image frame, and the electronic device can obtain the relative direction of any feature point with respect to the camera position of any image frame according to the spatial position of any feature point.

For example, it is assumed that any image frame is an image frame j, the feature point r is a feature point in the image frame j, the second history image frame corresponding to the feature point r is an image frame i, and a matching feature point of the feature point r in the second history image frame is a feature point q. Normalizing the coordinates of the feature point r to obtain the relative direction of the feature point r with respect to the camera center of any image frame j

It should be noted that, the first step and the second step are processes of obtaining predicted relative directions, the third step is a process of obtaining relative directions, the two processes may be performed simultaneously, or a predicted relative direction may be obtained first and then a relative direction may be obtained, or a relative direction may be obtained first and then a predicted relative direction may be obtained, that is, the first step and the second step are integrated to be a comprehensive step, and the electronic device may perform the comprehensive step and the third step simultaneously, or may perform the comprehensive step first and then the third step, or may perform the third step first and then the comprehensive step.

And step four, the difference between the predicted relative direction and the projection position of the relative direction in a plane perpendicular to the relative direction is taken as the error of any characteristic point by the electronic equipment.

After the predicted relative direction and the relative direction are obtained, the electronic device may compare the error of any feature point according to the predicted quantity and the actual quantity. Specifically, when acquiring the error, the two directions may be projected into the same plane to acquire the difference between the two projection positions as the error. In a possible implementation, the projection plane may be a plane perpendicular to the opposite direction, the plane may be a tangent plane of a sphere, and thus, the error may be referred to as a spherical error.

For example, the electronic device may predict the relative direction p and the relative direction obtained in the third and fourth steps

Projected to the opposite direction

On the vertical plane, the difference in the projection position is taken as the error of the feature point r.

The first step to the fourth step provide a method for acquiring errors of the feature points, the depth is used for representing the spatial position, the memory consumption can be reduced, the optimization speed is improved, and the method is also suitable for various cameras and improves the adaptability of the optimization process. Of course, the error of the feature point may also be a reprojection error or another error, which is not limited in this embodiment of the present invention.

204. For any image frame, the electronic device acquires a plurality of pieces of position difference information corresponding to the image frame, wherein the plurality of pieces of position difference information are used for representing the position difference of matching feature points of the image frame and a key frame before the image frame.

Through the first step to the third step, the electronic device acquires a plurality of image frames, and processes the plurality of image frames, so as to obtain image data of each image frame, for example, the image data may be data such as a spatial position of a feature point and a camera pose of the image frame, so that when a target virtual scene is to be created subsequently, the image data may be referred to obtain the target virtual scene corresponding to the plurality of image frames.

The electronic device may obtain a plurality of key frames from the plurality of image frames, the key frames being representative and critical in the plurality of image frames, and extract the key frames from the plurality of image frames to obtain sufficient image data for further processing to create the target virtual scene.

For the plurality of image frames, the electronic device may use a first image frame as a key frame, and for any subsequent image frame (current image frame), the electronic device may obtain a plurality of pieces of position difference information of matching feature points of the current image frame and a previous key frame, so as to determine whether to use the image frame as a key frame.

The position difference information of the feature point matched with the previous key frame of any image frame may include multiple types, that is, multiple influencing factors may be considered when the key image frame is acquired. Three position difference information items are provided below, and the plurality of items of position difference information in step 204 may include at least two items of position difference information. The three kinds of position difference information are explained below.

Position difference information one: a first number of matching feature points for the any image frame and the previous keyframe.

For two adjacent key frames, the number of matching feature points of the two key frames is required to be more, so that the positions of the cameras of the two key frames before and after the two key frames are not changed too much, and thus, when the two key frames are processed, the position difference of the matching feature points is small, and the change condition of the position of the camera and the change condition of the same landmark point in the two key frames can be accurately obtained. On the contrary, if the number of the matching feature points of the two adjacent key frames is small and the position difference of the matching feature points is large, the positioning may fail through the two key frames, and the change condition of the camera position cannot be obtained, that is, the target virtual scene with high accuracy cannot be obtained. Therefore, when determining whether any image frame is to be acquired as a key frame, the first position difference information may be considered, and subsequently, it may be determined whether the first position difference information meets the accuracy requirement of scene creation.

Then, in step 204, for any image frame, the electronic device may obtain a first number of matching feature points of the any image frame and the previous key frame, and use the first number as an index for measuring the accuracy of scene creation.

Position difference information two: the ratio of the first number of matched feature points of any image frame and the previous key frame to the second number of matched feature points of the previous key frame, wherein the second number is the number of the feature points of the previous key frame.

The position difference information may also be a relative attenuation rate of the matched feature points, where the relative attenuation rate is a ratio of a first number of the matched feature points of any image frame and a previous key frame to a second number of the feature points of the previous key frame. When the ratio is small, the position difference of the matched feature points is small, which indicates that the pose changes of the two image frame cameras are not very large, and the spatial position of the matched feature points does not change too much, otherwise, when the ratio is large, the position difference of the matched feature points is large, which indicates that the pose changes greatly of the two image frame cameras and the spatial position of the matched feature points changes greatly.

Similarly to the above position difference information, if the ratio of two adjacent key frames is too large, the position difference of the matching feature points of the two adjacent key frames may be large, and the positioning may fail through the two key frames, so that the change condition of the camera position cannot be obtained, that is, the target virtual scene with high accuracy cannot be obtained. Therefore, when judging whether any image frame needs to be acquired as a key frame, the position difference information II can be considered, and whether the position difference information II meets the accuracy requirement of scene creation can be judged subsequently. Then, in step 204, for any image frame, the electronic device may obtain a ratio of a first number of matched feature points of the any image frame and the previous key frame to a second number of matched feature points of the previous key frame.

Position difference information three: a fraction of the any image frame, the fraction being determined based on baseline lengths of the any image frame and the previous keyframe, and resolutions of matching feature points of the any image frame and the previous keyframe in the two image frames, respectively.

For any image frame, the electronic device can acquire a score of the image frame to represent whether the image frame is suitable as a key frame, wherein the larger the score is, the smaller the possible position difference of the matched feature points is, and the more suitable the image frame is as the key frame. Then for any image frame, the electronic device may obtain a score for that image frame in this step 204.

The fraction of the image frame is considered, i.e. the ratio of the base length of the image frame to the previous key frame and the resolution of the matching feature points is considered. It is understood that the larger the base length is, the more the camera positions in the two image frames may change, and the larger the difference in the positions of the matching feature points may be. The smaller the resolution of the matching feature point, the less representative the matching feature point and the less meaningful to determine a difference in position using the matching feature point.

In one possible implementation, the determining process of the score of each image frame may be: for any matching feature point in any image frame and at least one matching feature point of the previous key frame, the electronic equipment acquires a normal distribution value of the base length corresponding to the any matching feature point. The electronic device obtains a product of a minimum value in the resolutions each in the two image frames and the normal distribution value. The electronic equipment carries out weighted summation on the product of the at least one matched characteristic point to obtain the score of any image frame.

Specifically, the baseline length and resolution may be calculated values, or may be characterized by other parameters. In one possible embodiment, the baseline length corresponding to any one of the matching feature points is characterized by an included angle formed by a projection position of the any one of the matching feature points in any one of the image frames and a first connecting line direction and a second connecting line direction of the any one of the matching feature points, and the second connecting line direction is the connecting line direction of the projection position of the any one of the matching feature points in the previous keyframe and the any one of the matching feature points; the resolution of any matching feature point is characterized by the distance between the any matching feature point and the camera position of any image frame or the distance between the any matching feature point and the camera position of the previous key frame.

For example, the determination may be implemented using a formula as follows:

wherein, f (C) _i ) Is the fraction of the current image frame, p is the spatial position of the matching characteristic point p of the current image frame and the previous key frame, and angle C _i-1 pC _i Is the included angle between the matching feature point p and the current image frame and the previous key frame, the included angle is greater than or equal to 0, and is used for representing the length of the base line, d (p, C) _i-1 ) Is the camera position C of the matching feature point p and the previous key frame _i-1 A distance between d (p, C) _i ) Is the camera position C of the matched feature point p and the current image frame _i Greater than or equal to 0, for characterizing the resolution. g (x) is a normal distribution function, e is a natural constant, x is an independent variable, and f (C) is _i ) In the calculation formula, the independent variable x is < C _i-1 pC _i . μ is the mean of the independent variables and σ is the standard deviation. Σ is a summation function.

As described above, the electronic device can obtain at least two of the three kinds of position difference information as multiple items of position difference information, so as to determine whether the accuracy requirement of scene creation is met. It should be noted that, only three pieces of position difference information are provided, the multiple pieces of position difference information may further include other pieces of position difference information, and the scores of the keyframes may also be determined in other manners, for example, the multiple pieces of position difference information may further include depths obtained by feature points matched by any image frame and a previous keyframe in two image frames, which is not limited by the embodiment of the present invention.

205. When any one of the plurality of items of position difference information meets the accuracy requirement of scene creation, the electronic equipment acquires any image frame as a key frame.

After the electronic device acquires the plurality of pieces of position difference information, whether each piece of position difference information meets the accuracy requirement of scene creation or not can be judged, and when any piece of position difference information meets the accuracy requirement, the electronic device can acquire any image frame as a key frame.

In a possible implementation manner, the three pieces of position difference information may be considered in step 204, that is, the multiple pieces of position difference information may include the three pieces of position difference information, and accordingly, in step 205, for any image frame, the electronic device may determine the three pieces of position difference information, and when any image frame meets the accuracy requirement of scene creation, the electronic device may acquire the image frame as a key frame. Of course, any two items of the three pieces of position difference information may also be considered in step 204, and accordingly, in step 205, when any image frame is determined, it may also be determined whether the two items meet the accuracy requirement of scene creation, and when any item meets the accuracy requirement of scene creation, the electronic device may acquire the any image frame as a key frame. The embodiment of the invention does not limit the specific adopted position difference information, and does not limit the accuracy requirement of the scene creation.

In a specific embodiment, each of the three kinds of position difference information may correspond to a corresponding accuracy requirement for creating a scene, and the accuracy requirement for creating a scene corresponding to each kind of position difference information is described below.

For the first position difference information, a quantity threshold may be set for the first quantity of matching feature points of the any image frame and the previous key frame, and the accuracy requirement that the first position difference information meets the scene creation may be: the first number of matching feature points of any image frame and the previous key frame is less than a number threshold. That is, the step 205 may be: when the first number of the matched feature points of any image frame and the previous key frame is smaller than the number threshold, the electronic equipment acquires the image frame as the key frame. The number threshold may be set by a person skilled in the relevant art according to requirements, and is not limited by the embodiment of the present invention. When the first number is smaller than the number threshold, it is indicated that the keyframe needs to be added, otherwise, the next image frame is continuously judged, the first number of the matching feature points of the next image frame and the previous keyframe is smaller, the position difference of the matching feature points is larger, the positioning failure may be caused, and the target is lost.

For the second position difference information, a ratio threshold may be set for the ratio, and the accuracy requirement that the second position difference information satisfies the scene creation may be: the ratio of the first number to the second number of matched feature points of any image frame and the previous key frame is larger than a ratio threshold value. That is, the step 205 may be: when the ratio of the first number and the second number of the matched feature points of the any image frame and the previous key frame is larger than a ratio threshold value, the electronic equipment acquires the any image frame as the key frame. The ratio threshold may be set by a person skilled in the art as needed, and is not limited in this embodiment of the present invention. If the ratio obtained by the current image frame and the previous key frame is larger than the ratio threshold, it is indicated that the key frame needs to be added, otherwise, the next image frame is continuously judged, the relative attenuation rate (ratio) corresponding to the next image frame is larger, the position difference of the matched feature points is larger, the positioning failure may be caused, and the target is lost.

For the third position difference information, a score threshold may be set for the score of the image frame, and the third position difference information meeting the accuracy requirement of scene creation may be: the fraction of any one image frame is greater than a fraction threshold. That is, the step 205 may be: when the score of any image frame is larger than the score threshold value, the electronic equipment acquires the any image frame as a key frame. The score threshold may be set by a related technician as needed, which is not limited in the embodiment of the present invention.

It should be noted that, the above description is made on that each of the three types of location difference information meets the accuracy requirement for creating a scene, and if the electronic device acquires other location difference information in step 204, the step 205 may also include setting of the accuracy requirement for creating a scene corresponding to the other location difference information, which is not limited in the embodiment of the present invention.

206. For any one of the acquired multiple key frames, the electronic device creates a first local virtual scene corresponding to the any key frame based on the feature point of the any key frame and other key frames including the feature point of the any key frame.

After the electronic device acquires the multiple key frames, the image data of the multiple key frames can be used as a data basis for creating the target virtual scene. The electronic device may create a local virtual scene first, and then synthesize a plurality of local virtual scenes to obtain an overall target virtual scene. The step 206 may be executed each time a key frame is acquired to create a first local virtual scene corresponding to the key frame acquired this time, and the step 206 may also be executed after all key frames are acquired, which is not limited in the embodiment of the present invention.

When the electronic device may create the first local virtual scene corresponding to each key frame, for any key frame, the feature point of the key frame may be used as a basis for creating the first local virtual scene, and other key frames including the feature point may be used as key frames associated with the key frame, and the other key frames include the feature point, which may indicate that the camera poses of the key frame and the other key frames are not very different, and it may be considered that the landmark point in the other key frames is located in the vicinity of the landmark point in the key frame. Thus, the key frame and the other key frames can be utilized to create a first partial virtual scene corresponding to the key frame.

In particular, the first partial virtual scene may include at least one of a first partial virtual map and a first partial motion trajectory. The electronic device may obtain the spatial position of each landmark point in the first local virtual map based on the spatial positions of the feature points of the any key frame and the other key frames, so as to obtain the first local virtual map corresponding to the any key frame. The spatial position of each landmark point in the first local virtual scene corresponds to the spatial position of the feature point. The electronic device may acquire a first local motion trajectory of the device that acquired the plurality of image frames based on the camera poses of the any one keyframe and the other keyframes. Which kind of first local virtual scene obtained in step 206 may be set according to a requirement, which is not limited in the embodiment of the present invention.

207. The electronic device optimizes the first local virtual scene of any key frame and other key frames to obtain a second local virtual scene corresponding to any key frame.

After the first local virtual scene is obtained, the electronic device may further optimize the first local virtual scene to obtain a more accurate second local virtual scene. Similarly, the second local virtual scene may include at least one of a second local virtual map and a second local motion trajectory. The electronic device may optimize at least one of the first local virtual map and the first local motion trajectory to obtain at least one of a second local virtual map and a second local motion trajectory corresponding to the any one of the keyframes.

When the first local virtual scene is optimized, the optimization may be performed in the same manner as the optimization performed on the single image frame in step 203, except that the optimization is performed on the single image frame in step 203, so as to optimize the camera pose of the current image frame. In step 207, the optimization target includes a plurality of key frames, that is, the any key frame and the other key frames, so that the spatial positions and the camera poses of the feature points of the any key frame and the other key frames can be optimized, thereby improving the accuracy of the first local virtual scene.

Specifically, the optimization process may be: the electronic equipment acquires the sum of errors of all feature points of any key frame and other key frames, wherein the error of each feature point is used for indicating the difference between the spatial position of each feature point and the spatial position predicted based on the second historical image frame, and the spatial position predicted based on the second historical image frame is determined based on the depth of the matched feature point of each feature point. And the electronic equipment adjusts the camera poses of any key frame and other key frames and the depths of all the feature points based on the sum of the errors of all the feature points of any key frame and other key frames, and stops adjusting until the sum of the errors meets a target condition to obtain a second local virtual scene corresponding to any key frame.

The process of obtaining the error of any feature point is the same as the error obtaining process shown in the first to fourth steps in step 203, and details of the embodiment of the present invention are not repeated herein. The

steps

206 and 207 are processes for obtaining local information, which may be implemented by a single thread, for example, the

steps

206 and 207 may be executed based on a fourth thread.

In one possible implementation, after the step 207, the electronic device may further perform a relocation detection, also referred to as a closed-loop detection, to detect whether the device acquiring the plurality of image frames has returned to the historically passed location. When it is determined from a plurality of consecutive keyframes that a distance between a location of a device acquiring the plurality of image frames and a location of the device in the historical image frames is less than a distance threshold, the electronic device may fuse feature points of the plurality of consecutive keyframes with feature points of the historical image frames. Specifically, the fusion process may be: and the electronic equipment carries out average calculation on the spatial positions of the matching feature points of the plurality of continuous key frames and the historical image frames to obtain the spatial position of the fused matching feature points. In one possible embodiment, the electronic device may perform the step of relocation detection and fusion based on a fifth thread, and the fifth thread and the four threads may also be different threads, that is, the relocation detection and fusion process may be implemented by separate threads.

In a possible embodiment, the relocation detection process may be further implemented by a visual vocabulary tree, wherein the visual vocabulary tree may be established based on the image data obtained in the feature extraction step and the binocular matching step in step 202, and in a possible implementation manner, the feature matching in step 203 may be further implemented based on the visual vocabulary tree, so that the feature matching speed may be further increased.

In a possible implementation manner, after the repositioning fusion is performed, the spatial positions of the matching feature points of the plurality of consecutive key frames and the historical image frame are adjusted, and then it is determined that the camera poses of the plurality of key frames may not be in accordance with the relative relationship between the previous and next image frames according to the adjusted matching feature points, so that the camera poses of the plurality of key frames can be corrected accordingly, so that the camera poses of the plurality of currently acquired key frames are more accurate.

208. The electronic equipment creates an initial virtual scene corresponding to the plurality of key frames based on the second local virtual scene corresponding to the plurality of key frames.

After obtaining the accurate second local virtual scene, the electronic device may synthesize the second local virtual scenes corresponding to the multiple key frames, and create an initial virtual scene corresponding to the multiple key frames, where the initial virtual scene is a global virtual scene corresponding to the multiple image frames acquired in step 201. Similarly, the initial virtual scene may include at least one of an initial virtual map and an initial motion trajectory.

Specifically, the data bases included in the plurality of second local virtual scenes corresponding to the plurality of key frames may include the spatial position of the same feature point or the camera pose of the same key frame, and they may be the same or different in different second local virtual scenes, so that when an initial virtual scene is created, the second local virtual scenes may be further processed in a comprehensive manner. The process may also adopt an averaging process, or may also adopt other processes, which is not limited in the embodiment of the present invention.

209. The electronic equipment optimizes the initial virtual scenes of the plurality of key frames to obtain a target virtual scene.

Similarly, after the initial virtual scene is obtained, the electronic device needs to integrate all image data for overall optimization, so as to obtain a complete and accurate target virtual scene. The target virtual scene may include at least one of a target virtual map and a target initial motion trajectory.

The same process as the optimization process in step 207 and step 203 is performed in step 209, except that the step 209 is to optimize the initial virtual scene corresponding to all the key frames, the step 207 is to optimize the local virtual scene corresponding to part of the key frames, the step 203 is to optimize the image data corresponding to a single image frame, and the step 209 and step 207 are to adjust the spatial position and the camera pose of the feature points, and the step 203 is to optimize the camera pose of the single image frame.

Specifically, the optimization process may be: the electronic device obtains the sum of the errors of all the feature points of the plurality of key frames. And the electronic equipment adjusts the camera poses of the plurality of key frames and the depths of all the feature points based on the sum of the errors of all the feature points of the plurality of key frames, and stops adjusting until the sum of the errors meets a target condition to obtain a target virtual scene corresponding to the plurality of image frames.

The process of obtaining the error of any feature point is the same as the error obtaining process shown in the first to fourth steps in step 203, and the details of the embodiment of the present invention are not repeated herein.

The step 208 and the step 209 are processes for acquiring the target virtual scene, and in one possible implementation, the processes may be implemented by separate threads, for example, the step 208 and the step 209 may be implemented based on a sixth thread, which is not the same thread as the five threads.

It should be noted that, in the above step 206 to step 209, a process of creating a target virtual scene based on the obtained multiple key frames is performed, and in this process, the electronic device may perform at least one of the following steps one and two:

the method comprises the steps that firstly, the electronic equipment obtains the spatial position of each landmark point in a target virtual scene based on the obtained spatial positions of the feature points of a plurality of key frames, and a target virtual map is obtained.

And secondly, the electronic equipment acquires target motion tracks of equipment for acquiring the multiple image frames based on the acquired camera poses of the multiple key frames. Specifically, the electronic device may acquire, based on the camera poses of the acquired multiple key frames, the position of the device that acquires the multiple image frames at the time of acquiring each key frame, so that, based on the multiple key frames, the target motion trajectory of the device may be acquired.

For example, the electronic device acquires a plurality of image frames, processes the image frames, extracts a plurality of key frames, and creates a target virtual scene based on the plurality of key frames. Taking the application of the image processing method in the field of automatic driving as an example, the plurality of image frames can be acquired by an image acquisition device installed on a vehicle, the image acquisition device sends the acquired plurality of image frames to an electronic device with an image processing function, and the electronic device can perform the processing process on the plurality of image frames to create a target virtual map, can also position the vehicle to obtain a target motion track, and can also create the target virtual map and obtain the target motion track. Taking as an example that the target motion trajectory of the vehicle is obtained, and a target virtual map around the driving path of the vehicle is created, as shown in fig. 6, the motion trajectory of the vehicle and a sparse point cloud map may be obtained in step 208, and four specific motion trajectories and sparse point cloud maps are shown in fig. 6. Of course, the step 208 may output only the motion trajectory of the vehicle, or may output only the sparse point cloud map. The motion track of the vehicle is a target motion track, and the sparse point cloud map is a target virtual map. The motion track and the sparse point cloud map are the target virtual scene.

Next, a specific flow of the image processing method is described by using a specific example, as shown in fig. 7, the electronic device may implement the image processing method by six separate threads, and when a new frame image (image frame) is acquired, the electronic device may perform feature point extraction on the image frame by using the thread one, that is, the step 201 and the step 202, and if the camera is a binocular camera, the electronic device may perform binocular matching by using the thread two, which corresponds to the binocular matching process in the step 201.

After the above-described processes of feature extraction and binocular matching, the electronic device may perform an attitude initial estimation step, that is, predict a camera pose, then perform feature point matching based on guidance of the predicted camera pose, corresponding to step 203, after matching, the electronic device may perform single-frame attitude optimization, that is, optimize the camera pose of a single image frame, and determine whether the image frame is a key frame based on optimized data, which is a key frame selection process, corresponding to

steps

204 and 205. The pose initial estimation, guide-based feature point matching, single frame pose optimization, and keyframe selection processes may be implemented based on thread three.

The electronic device may create a local scene based on the thread four, and optimize the created local scene, where it should be noted that, for a sliding window commonly used in a real-time map creation and positioning system, the local scene may adaptively adjust the size of a local problem, and efficiency and accuracy are both considered.

After the local scene optimization step, the electronic device may perform a repositioning detection and repositioning fusion step based on the visual vocabulary tree, that is, the content shown in step 207, the repositioning detection and fusion process may be implemented based on thread five, and the electronic device may optimize all the key frames based on thread six, that is, optimize the global scene, to obtain a final target virtual scene.

In the image processing method, the accuracy of the key frame selection algorithm is better in the estimation process of the camera pose before feature matching, and the robustness of the image processing method can be improved. The module design is more detailed, the processes of feature extraction and binocular matching are independent, the image processing speed and efficiency are improved through independent threads, the real-time performance of the image processing can be effectively improved, other sensors can be conveniently expanded, and the applicability of the image processing method is better.

As shown in fig. 8, in the image processing method, a video or a continuous image may be acquired by a front-end sensor and sent to a back end, and the back end may perform the image processing procedure on the video or the continuous image to obtain at least one of a camera pose (motion track) and a map, and may be used for display by a front-end browser, or for automatic driving control at the back end, or for creating a virtual scene in the game and movie industry, and the like.

The target virtual scene obtained in step 209 is data in a target coordinate system, the target coordinate system is a coordinate system using the position of the device acquiring the image frame in the first image frame as an origin position, and if the target virtual scene in the world coordinate system needs to be obtained or the target virtual scene in a certain specific coordinate system needs to be obtained, the target virtual scene may be converted based on the relationship between the world coordinate system or the specific coordinate system and the target coordinate system. The conversion process may be set by a person skilled in the relevant art according to a requirement, and is not limited in this embodiment of the present invention.

In the embodiment of the invention, when the key frames are acquired from a plurality of image frames, when judging whether any image frame is acquired as the key frame, a plurality of pieces of position difference information of the feature points matched with the any image frame and the previous key frame before the any image frame are considered, when any image frame meets the accuracy requirement of scene creation, the any image frame is acquired as the key frame so as to acquire enough key frames, and the problem that the virtual scene created is inaccurate due to the failure of positioning equipment for acquiring the image frames or landmark points can be avoided, so that the accuracy of the target virtual scene obtained by the image processing method provided by the embodiment of the invention is good.

All the above optional technical solutions may be combined arbitrarily to form optional embodiments of the present invention, and are not described in detail herein.

Fig. 9 is a schematic structural diagram of an image processing apparatus according to an embodiment of the present invention, and referring to fig. 9, the apparatus includes:

an image obtaining module 901, configured to obtain a plurality of image frames;

an information obtaining module 902, configured to obtain, for any image frame, multiple pieces of position difference information corresponding to the image frame, where the multiple pieces of position difference information are used to indicate position differences of matching feature points of the image frame and a previous key frame of the image frame;

the image obtaining module 901 is further configured to obtain any image frame as a key frame when any of the multiple items of position difference information meets the accuracy requirement of scene creation;

a scene creating module 903, configured to create a target virtual scene based on the acquired multiple key frames, where the target virtual scene is used to represent a scene corresponding to the multiple image frames.

In one possible implementation, the information obtaining module 902 is configured to perform at least two of the following:

for any image frame, acquiring a first number of matched feature points of the image frame and the previous key frame;

for any image frame, acquiring the ratio of a first number to a second number of matched feature points of the any image frame and the previous key frame, wherein the second number is the number of the feature points of the previous key frame;

for any image frame, acquiring a score of the any image frame, wherein the score is determined based on the baseline lengths of the any image frame and the previous key frame and the resolutions of the matched feature points of the any image frame and the previous key frame in the two image frames respectively.

In one possible implementation, the apparatus further includes a determining module configured to: for any matching feature point in at least one matching feature point of the image frame and the previous key frame, acquiring a normal distribution value of the base length corresponding to the any matching feature point; obtaining a product of a minimum value and the normal distribution value in the resolutions of the two image frames; and carrying out weighted summation on the products of the at least one matched characteristic point to obtain the score of any image frame.

In one possible implementation, the image obtaining module 901 is further configured to perform any one of the following:

when the first number of the feature points matched with the previous key frame is smaller than a number threshold, acquiring any image frame as a key frame;

when the ratio of the first number and the second number of the matched feature points of any image frame and the previous key frame is larger than a ratio threshold value, acquiring any image frame as a key frame;

and when the score of any image frame is larger than a score threshold value, acquiring the any image frame as a key frame.

In one possible implementation, the apparatus further includes:

the characteristic extraction module is used for extracting the characteristics of the plurality of image frames to obtain the characteristic point of each image frame;

and the characteristic matching module is used for matching the characteristic point of each image frame with the characteristic point of the previous image frame of each image frame.

In one possible implementation manner, the feature extraction module and the feature matching module are further configured to perform feature extraction on the image frames and match feature points extracted from the image frames respectively by using different threads.

In a possible implementation manner, the apparatus further includes a binocular matching module, where the binocular matching module is configured to, when the camera that acquires the plurality of image frames is a binocular camera, match feature points of two image frames that are acquired by a first camera and a second camera in the binocular camera respectively through a separate thread.

In one possible implementation, the feature matching module is configured to: for each image frame, predicting the camera pose of each image frame according to the camera pose of a first historical image frame before each image frame; and matching the feature point of each image frame with the feature point of the previous image frame of each image frame according to the camera pose of each image frame to obtain the spatial position of the feature point of each image frame.

In one possible implementation, the apparatus further includes an optimization module configured to: acquiring the sum of errors of all feature points of each image frame, wherein the error of each feature point is used for indicating the difference between the spatial position of each feature point and the spatial position predicted based on a second historical image frame, and the spatial position predicted based on the second historical image frame is determined based on the depth of the matched feature point of each feature point; and adjusting the camera pose of each image frame based on the sum of the errors of all the feature points of each image frame, and stopping adjusting until the sum of the errors meets a target condition to obtain the optimized camera pose of each image frame.

In one possible implementation, the scene creation module 903 is configured to: for any key frame in the acquired multiple key frames, creating a first local virtual scene corresponding to the key frame based on the feature point of the key frame and other key frames comprising the feature point of the key frame; optimizing the first local virtual scene of any key frame and other key frames to obtain a second local virtual scene corresponding to any key frame; based on second local virtual scenes corresponding to the plurality of key frames, creating initial virtual scenes corresponding to the plurality of key frames; and optimizing the initial virtual scenes of the plurality of key frames to obtain target virtual scenes corresponding to the plurality of image frames.

In one possible implementation, the scenario creation module 903 is configured to: acquiring the sum of errors of all feature points of any key frame and other key frames, wherein the error of each feature point is used for indicating the difference between the spatial position of each feature point and the spatial position predicted based on a second historical image frame, and the spatial position predicted based on the second historical image frame is determined based on the depth of the matched feature point of each feature point; adjusting the camera poses of any key frame and other key frames and the depths of all feature points based on the sum of the errors of all feature points of any key frame and other key frames, and stopping adjusting until the sum of the errors meets a target condition to obtain a second local virtual scene corresponding to any key frame;

the scene creation module 903 is configured to: acquiring the sum of errors of all the characteristic points of the plurality of key frames; and adjusting the camera poses of the plurality of key frames and the depths of all the feature points based on the sum of the errors of all the feature points of the plurality of key frames, and stopping adjustment until the sum of the errors meets a target condition to obtain a target virtual scene corresponding to the plurality of image frames.

In one possible implementation, the scene creation module 903 is configured to: for any feature point of any image frame, acquiring the spatial position of the matched feature point in a camera coordinate system of a second historical image frame according to the depth of the matched feature point of the any feature point in the second historical image frame, wherein the second historical image frame is a first image frame in a plurality of image frames matched with the any feature point; acquiring a predicted relative direction of any feature point relative to the camera position of any image frame according to the spatial position of the matched feature point in a camera coordinate system of the second historical image frame, the camera pose of the second historical image frame and the relative camera pose of the any image frame relative to the second historical image frame; acquiring the relative direction of the any characteristic point relative to the camera position of the any image frame; and taking the difference between the predicted relative direction and the projection position of the relative direction in a plane perpendicular to the relative direction as the error of any one feature point.

In one possible implementation, the scenario creation module 903 is configured to perform at least one of:

acquiring the spatial position of each landmark point in a target virtual scene based on the acquired spatial positions of the feature points of the plurality of key frames to obtain a target virtual map;

and acquiring a target motion track of equipment for acquiring the plurality of image frames based on the acquired camera poses of the plurality of key frames.

According to the device provided by the embodiment of the invention, when the key frames are acquired from a plurality of image frames, when judging whether any image frame is acquired as the key frame, a plurality of pieces of position difference information of the characteristic points matched with the previous key frame before any image frame are considered, when any image frame meets the accuracy requirement of scene creation, any image frame is acquired as the key frame to acquire enough key frames, and the problem that the virtual scene created due to failure in positioning equipment or landmark points for acquiring the image frames is inaccurate can be avoided, so that the accuracy of the target virtual scene obtained by the image processing method provided by the embodiment of the invention is good.

It should be noted that: in the image processing apparatus provided in the above embodiment, when processing an image, only the division of the functional modules is illustrated, and in practical applications, the functions may be distributed by different functional modules according to needs, that is, the internal structure of the electronic device may be divided into different functional modules to complete all or part of the functions described above. In addition, the image processing apparatus and the image processing method provided in the above embodiments belong to the same concept, and specific implementation processes thereof are described in the method embodiments in detail and are not described herein again.

The electronic device may be provided as a terminal shown in fig. 10 described below, or may be provided as a server shown in fig. 11 described below, which is not limited in this embodiment of the present invention.

Fig. 10 is a schematic structural diagram of a terminal according to an embodiment of the present invention. The terminal 1000 can be: a smart phone, a tablet computer, an MP3 player (Moving Picture Experts Group Audio Layer III, motion Picture Experts compression standard Audio Layer 3), an MP4 player (Moving Picture Experts Group Audio Layer IV, motion Picture Experts compression standard Audio Layer 4), a notebook computer, or a desktop computer. Terminal 1000 can also be referred to as a user equipment, portable terminal, laptop terminal, desktop terminal, or the like among other names.

In general, terminal 1000 can include: a processor 1001 and a memory 1002.

Processor 1001 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and so forth. The processor 1001 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field-Programmable Gate Array), and a PLA (Programmable Logic Array). The processor 1001 may also include a main processor and a coprocessor, where the main processor is a processor for Processing data in a wake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 1001 may be integrated with a GPU (Graphics Processing Unit) that is responsible for rendering and drawing content that needs to be displayed on the display screen. In some embodiments, the processor 1001 may further include an AI (Artificial Intelligence) processor for processing a calculation operation related to machine learning.

Memory 1002 may include one or more computer-readable storage media, which may be non-transitory. The memory 1002 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in the memory 1002 is used to store at least one instruction for execution by the processor 1001 to implement the image processing method provided by the method embodiments of the present invention.

In some embodiments, terminal 1000 can also optionally include: a peripheral interface 1003 and at least one peripheral. The processor 1001, memory 1002, and peripheral interface 1003 may be connected by buses or signal lines. Various peripheral devices may be connected to peripheral interface 1003 via a bus, signal line, or circuit board. Specifically, the peripheral device includes: at least one of radio frequency circuitry 1004, touch screen display 1005, camera 1006, audio circuitry 1007, positioning components 1008, and power supply 1009.

Peripheral interface 1003 may be used to connect at least one peripheral associated with I/O (Input/Output) to processor 1001 and memory 1002. In some embodiments, processor 1001, memory 1002, and peripheral interface 1003 are integrated on the same chip or circuit board; in some other embodiments, any one or two of the processor 1001, the memory 1002, and the peripheral interface 1003 may be implemented on separate chips or circuit boards, which are not limited by this embodiment.

The Radio Frequency circuit 1004 is used to receive and transmit RF (Radio Frequency) signals, also known as electromagnetic signals. The radio frequency circuitry 1004 communicates with communication networks and other communication devices via electromagnetic signals. The radio frequency circuit 1004 converts an electrical signal into an electromagnetic signal for transmission, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 1004 comprises: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber identity module card, and so forth. The radio frequency circuit 1004 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocols include, but are not limited to: metropolitan area networks, various generation mobile communication networks (2G, 3G, 4G, and 5G), wireless local area networks, and/or WiFi (Wireless Fidelity) networks. In some embodiments, the rf circuit 1004 may further include NFC (Near Field Communication) related circuits, which are not limited by the present invention.

The display screen 1005 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display screen 1005 is a touch display screen, the display screen 1005 also has the ability to capture touch signals on or over the surface of the display screen 1005. The touch signal may be input to the processor 1001 as a control signal for processing. At this point, the display screen 1005 may also be used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard. In some embodiments, display screen 1005 can be one, providing a front panel of terminal 1000; in other embodiments, display 1005 can be at least two, respectively disposed on different surfaces of terminal 1000 or in a folded design; in still other embodiments, display 1005 can be a flexible display disposed on a curved surface or on a folded surface of terminal 1000. Even more, the display screen 1005 may be arranged in a non-rectangular irregular figure, i.e., a shaped screen. The Display screen 1005 may be made of LCD (Liquid Crystal Display), OLED (Organic Light-Emitting Diode), and the like.

The camera assembly 1006 is used to capture images or video. Optionally, the camera assembly 1006 includes a front camera and a rear camera. Generally, a front camera is disposed at a front panel of the terminal, and a rear camera is disposed at a rear surface of the terminal. In some embodiments, the number of the rear cameras is at least two, and each rear camera is any one of a main camera, a depth-of-field camera, a wide-angle camera and a telephoto camera, so that the main camera and the depth-of-field camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize panoramic shooting and VR (Virtual Reality) shooting functions or other fusion shooting functions. In some embodiments, camera assembly 1006 may also include a flash. The flash lamp can be a single-color temperature flash lamp or a double-color temperature flash lamp. The double-color-temperature flash lamp is a combination of a warm-light flash lamp and a cold-light flash lamp and can be used for light compensation under different color temperatures.

The audio circuit 1007 may include a microphone and a speaker. The microphone is used for collecting sound waves of a user and the environment, converting the sound waves into electric signals, and inputting the electric signals into the processor 1001 for processing or inputting the electric signals into the radio frequency circuit 1004 for realizing voice communication. For stereo sound collection or noise reduction purposes, multiple microphones can be provided, each at a different location of terminal 1000. The microphone may also be an array microphone or an omni-directional acquisition microphone. The speaker is used to convert electrical signals from the processor 1001 or the radio frequency circuit 1004 into sound waves. The loudspeaker can be a traditional film loudspeaker and can also be a piezoelectric ceramic loudspeaker. When the speaker is a piezoelectric ceramic speaker, the speaker can be used for purposes such as converting an electric signal into a sound wave audible to a human being, or converting an electric signal into a sound wave inaudible to a human being to measure a distance. In some embodiments, the audio circuit 1007 may also include a headphone jack.

A Location component 1008 is employed to locate a current geographic Location of terminal 1000 for purposes of navigation or LBS (Location Based Service).

Power supply 1009 is used to supply power to various components in terminal 1000. The power source 1009 may be alternating current, direct current, disposable battery, or rechargeable battery. When the power source 1009 includes a rechargeable battery, the rechargeable battery may support wired charging or wireless charging. The rechargeable battery may also be used to support fast charge technology.

In some embodiments, terminal 1000 can also include one or more sensors 1010. The one or more sensors 1010 include, but are not limited to: acceleration sensor 1011, gyro sensor 1012, pressure sensor 1013, optical sensor 1015, and proximity sensor 1016.

Acceleration sensor 1011 can detect acceleration magnitudes on three coordinate axes of a coordinate system established with terminal 1000. For example, the acceleration sensor 1011 can be used to detect the components of the gravitational acceleration on three coordinate axes. The processor 1001 may control the touch display screen 1005 to display a user interface in a landscape view or a portrait view according to the gravitational acceleration signal collected by the acceleration sensor 1011. The acceleration sensor 1011 may also be used for acquisition of motion data of a game or a user.

The gyro sensor 1012 may detect a body direction and a rotation angle of the terminal 1000, and the gyro sensor 1012 and the acceleration sensor 1011 may cooperate to acquire a 3D motion of the user on the terminal 1000. The processor 1001 may implement the following functions according to the data collected by the gyro sensor 1012: motion sensing (such as changing the UI according to a user's tilting operation), image stabilization while shooting, game control, and inertial navigation.

Pressure sensor 1013 may be disposed on a side frame of terminal 1000 and/or on a lower layer of touch display 1005. When the pressure sensor 1013 is disposed on a side frame of the terminal 1000, a user's grip signal of the terminal 1000 can be detected, and left-right hand recognition or shortcut operation can be performed by the processor 1001 according to the grip signal collected by the pressure sensor 1013. When the pressure sensor 1013 is disposed at a lower layer of the touch display screen 1005, the processor 1001 controls the operability control on the UI interface according to the pressure operation of the user on the touch display screen 1005. The operability control comprises at least one of a button control, a scroll bar control, an icon control and a menu control.

Optical sensor 1015 is used to collect ambient light intensity. In one embodiment, the processor 1001 may control the display brightness of the touch display screen 1005 according to the intensity of the ambient light collected by the optical sensor 1015. Specifically, when the ambient light intensity is high, the display brightness of the touch display screen 1005 is increased; when the ambient light intensity is low, the display brightness of the touch display screen 1005 is turned down. In another embodiment, the processor 1001 may also dynamically adjust the shooting parameters of the camera assembly 1006 according to the intensity of the ambient light collected by the optical sensor 1015.

Proximity sensor 1016, also known as a distance sensor, is typically disposed on a front panel of terminal 1000. Proximity sensor 1016 is used to gather the distance between the user and the front face of terminal 1000. In one embodiment, when proximity sensor 1016 detects that the distance between the user and the front surface of terminal 1000 is gradually reduced, touch display screen 1005 is controlled by processor 1001 to switch from a bright screen state to a dark screen state; when proximity sensor 1016 detects that the distance between the user and the front of terminal 1000 is gradually increased, touch display screen 1005 is controlled by processor 1001 to switch from a breath-screen state to a bright-screen state.

Those skilled in the art will appreciate that the configuration shown in FIG. 10 is not intended to be limiting and that terminal 1000 can include more or fewer components than shown, or some components can be combined, or a different arrangement of components can be employed.

Fig. 11 is a schematic structural diagram of a server according to an embodiment of the present invention, where the server 1100 may generate relatively large differences due to different configurations or performances, and may include one or more processors (CPUs) 1101 and one or more memories 1102, where the memory 1102 stores at least one instruction, and the at least one instruction is loaded and executed by the processors 1101 to implement the image processing methods provided by the foregoing method embodiments. Certainly, the server may further have a wired or wireless network interface, a keyboard, an input/output interface, and other components to facilitate input and output, and the server may further include other components for implementing functions of the device, which are not described herein again.

In an exemplary embodiment, there is also provided a computer-readable storage medium, such as a memory, comprising instructions executable by a processor to perform the image processing method in the above-described embodiments. For example, the computer-readable storage medium may be a Read-Only Memory (ROM), a Random Access Memory (RAM), a Compact Disc Read-Only Memory (CD-ROM), a magnetic tape, a floppy disk, an optical data storage device, and the like.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by hardware related to instructions of a program, and the program may be stored in a computer readable storage medium, where the above mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk. The above description is only for the purpose of illustrating the preferred embodiments of the present invention and should not be taken as limiting the scope of the present invention, which is intended to cover any modifications, equivalents, improvements, etc. within the spirit and scope of the present invention.

Claims

1. An image processing method, characterized in that the method comprises:

acquiring a plurality of image frames;

for any matching feature point in any image frame and at least one matching feature point of a previous key frame of the image frame, acquiring a normal distribution value of a base length corresponding to the any matching feature point; acquiring the product of the normal distribution value and the minimum value of the resolution of each of the matched feature points of any image frame and the previous key frame in the two image frames; performing weighted summation on the product of the at least one matched feature point to obtain the fraction of any image frame;

when the fraction of any image frame is larger than a fraction threshold value, acquiring the any image frame as a key frame;

2. The method of claim 1, wherein after the acquiring the plurality of image frames, the method further comprises:

extracting the features of the image frames to obtain feature points of each image frame;

and matching the characteristic point of each image frame with the characteristic point of the previous image frame of each image frame.

3. The method of claim 2, further comprising:

and respectively extracting the features of the image frames and matching the extracted feature points of the image frames by adopting different threads.

4. The method of claim 3, wherein after the feature extraction of the plurality of image frames to obtain the feature point of each image frame, the method further comprises:

and when the camera for collecting the image frames is a binocular camera, matching the characteristic points of the two image frames respectively collected by the first camera and the second camera in the binocular camera through a single thread.

5. The method of claim 2, wherein the matching the feature points of each image frame with the feature points of the previous image frame of each image frame comprises:

for each image frame, predicting a camera pose of each image frame according to a camera pose of a first historical image frame before each image frame;

and matching the feature point of each image frame with the feature point of the previous image frame of each image frame according to the camera pose of each image frame to obtain the spatial position of the feature point of each image frame.

6. The method according to claim 5, wherein after the matching of the feature point of each image frame with the feature point of the previous image frame of each image frame according to the camera pose of each image frame to obtain the spatial position of the feature point of each image frame, the method further comprises:

acquiring the sum of errors of all feature points of each image frame, wherein the error of each feature point is used for indicating the difference between the spatial position of each feature point and the spatial position predicted based on a second historical image frame, and the spatial position predicted based on the second historical image frame is determined based on the depth of the matched feature point of each feature point;

and adjusting the camera pose of each image frame based on the sum of the errors of all the feature points of each image frame, and stopping adjusting until the sum of the errors meets a target condition to obtain the optimized camera pose of each image frame.

7. The method of claim 1, wherein creating a target virtual scene based on the obtained plurality of key frames comprises:

for any key frame in the obtained multiple key frames, creating a first local virtual scene corresponding to the key frame based on the feature point of the key frame and other key frames comprising the feature point of the key frame;

optimizing the first local virtual scene of any key frame and other key frames to obtain a second local virtual scene corresponding to any key frame;

based on the second local virtual scenes corresponding to the plurality of key frames, creating initial virtual scenes corresponding to the plurality of key frames;

and optimizing the initial virtual scenes of the plurality of key frames to obtain target virtual scenes corresponding to the plurality of image frames.

8. The method according to claim 7, wherein the optimizing the first local virtual scene of the any key frame and the other key frames to obtain a second local virtual scene corresponding to the any key frame includes:

acquiring the sum of errors of all feature points of any one key frame and other key frames, wherein the error of each feature point is used for indicating the difference between the spatial position of each feature point and the spatial position predicted based on a second historical image frame, and the spatial position predicted based on the second historical image frame is determined based on the depth of the matched feature point of each feature point;

adjusting the camera poses of any key frame and other key frames and the depths of all feature points based on the sum of errors of all feature points of any key frame and other key frames, and stopping adjusting until the sum of errors meets a target condition to obtain a second local virtual scene corresponding to any key frame;

the optimizing the initial virtual scenes of the plurality of key frames to obtain the target virtual scenes corresponding to the plurality of image frames includes:

acquiring the sum of errors of all feature points of the plurality of key frames;

and adjusting the camera poses of the plurality of key frames and the depths of all the feature points based on the sum of the errors of all the feature points of the plurality of key frames, and stopping adjustment until the sum of the errors meets a target condition to obtain a target virtual scene corresponding to the plurality of image frames.

9. The method of claim 1, wherein creating a target virtual scene based on the obtained plurality of key frames comprises at least one of:

10. An image processing apparatus, characterized in that the apparatus comprises:

the image acquisition module is used for acquiring a plurality of image frames;

the information acquisition module is used for acquiring a normal distribution value of a baseline length corresponding to any matching feature point in any image frame and at least one matching feature point of a previous key frame of the image frame; obtaining the product of the normal distribution value and the minimum value of the resolution of each of the matched feature points of any image frame and the previous key frame in the two image frames; performing weighted summation on the product of the at least one matched feature point to obtain the fraction of any image frame;

the image acquisition module is further used for acquiring any image frame as a key frame when the score of the image frame is greater than a score threshold value;

and the scene creating module is used for creating a target virtual scene based on the acquired key frames, wherein the target virtual scene is used for representing scenes corresponding to the image frames.

11. The apparatus of claim 10, further comprising:

the characteristic extraction module is used for extracting the characteristics of the plurality of image frames to obtain characteristic points of each image frame;

the characteristic matching module is used for matching the characteristic point of each image frame with the characteristic point of the previous image frame of each image frame.

12. The apparatus of claim 11, wherein the feature extraction module and the feature matching module are further configured to respectively perform feature extraction on image frames and match feature points extracted from image frames by using different threads.

13. The apparatus of claim 12, further comprising a binocular matching module to:

when the cameras for collecting the image frames are binocular cameras, matching feature points of the two image frames respectively collected by a first camera and a second camera in the binocular cameras through independent threads.

14. The apparatus of claim 11, wherein the feature matching module is configured to:

15. The apparatus of claim 14, further comprising an optimization module to:

16. The apparatus of claim 10, wherein the scene creation module is configured to:

for any key frame in the acquired multiple key frames, creating a first local virtual scene corresponding to the any key frame based on the feature point of the any key frame and other key frames comprising the feature point of the any key frame;

based on second local virtual scenes corresponding to the plurality of key frames, creating initial virtual scenes corresponding to the plurality of key frames;

17. The apparatus of claim 16, wherein the scene creation module is configured to:

18. The apparatus of claim 10, wherein the scene creation module is configured to perform at least one of:

19. An electronic device, comprising one or more processors and one or more memories having stored therein at least one instruction that is loaded and executed by the one or more processors to perform operations performed by the image processing method of any of claims 1 to 9.

20. A computer-readable storage medium having stored therein at least one instruction, which is loaded and executed by a processor to perform operations performed by the image processing method of any one of claims 1 to 9.