WO2020134818A1

WO2020134818A1 - Image processing method and related product

Info

Publication number: WO2020134818A1
Application number: PCT/CN2019/121345
Authority: WO
Inventors: 赵培骁; 虞勇波; 黄轩; 王孝宇
Original assignee: 深圳云天励飞技术有限公司
Priority date: 2018-12-29
Filing date: 2019-11-27
Publication date: 2020-07-02
Also published as: CN109754461A

Abstract

An image processing method and a related product. The method comprises: obtaining the video stream of a designated region by means of a single camera (101); sampling the video stream to obtain multiple video images (102); preprocessing the multiple video images to obtain multiple preprocessed video images (103); according to the multiple preprocessed video images, performing depth feature extraction to obtain a feature set (104); according to the feature set, generating a depth map (105); and processing the depth map according to a point cloud data processing technique to obtain a 3D image (106). The implementation cost of three-dimensional reconstruction can be reduced.

Description

Image processing method and related products

This application requires the priority of the Chinese patent application filed on December 29, 2018, with the application number 201811643004.6 and the invention titled "Image Processing Methods and Related Products", the entire contents of which are incorporated by reference in this application.

Technical field

This application relates to the technical field of image processing, and in particular to an image processing method and related products.

Background technique

With the development and progress of artificial intelligence technology, 3D reconstruction technology has been widely used in many cutting-edge technologies. It is a common scientific problem and core technology in the fields of computer vision, medical image processing, scientific computing and virtual reality, and digital media creation.

Traditionally, three-dimensional reconstruction is mostly based on scene point cloud data, and point cloud data is mostly obtained through multiple cameras, laser cameras, etc. After the acquisition, multiple steps such as three-dimensional matching are required, which brings system costs. High, high requirements on the system's computing power, unable to achieve miniaturization and other issues.

Summary of the invention

The embodiments of the present application provide an image processing method and related products, which can reduce the implementation cost of three-dimensional reconstruction.

A first aspect of an embodiment of the present application provides an image processing method, including:

Obtain the video stream of the specified area through a single camera;

Sampling the video stream to obtain multiple video images;

Preprocessing the multiple video images to obtain the preprocessed multiple video images;

Performing depth feature extraction according to the pre-processed multiple video images to obtain a feature set;

Generate a depth map according to the feature set;

The depth map is processed according to point cloud data processing technology to obtain a 3D image.

Optionally, the performing deep feature extraction according to the multiple video images to obtain a feature set includes:

Performing image quality evaluation on each video image in the pre-processed multiple video images to obtain multiple image quality evaluation values;

A maximum value is selected from the plurality of image quality evaluation values, and the preprocessed video image corresponding to the maximum value is input to a preset convolutional neural network to obtain a feature set.

Optionally, in the case where each video image in the plurality of video images includes a human face,

The image quality evaluation is performed on each of the pre-processed multiple video images to obtain multiple image quality evaluation values, including:

Performing image segmentation on the video image i to obtain a target face image, where the video image i is any frame of the plurality of video images after the preprocessing;

Acquiring a target face image, and acquiring a two-dimensional angle value of the target face image, where the two-dimensional angle value includes an x angle value and a y angle value;

Acquiring two weight values corresponding to the two-dimensional angle value, wherein the target first weight value corresponding to the x angle value, the target second weight value corresponding to the y angle value, and the target first weight value are The sum of the second weights of the target is 1;

Performing a weighted operation according to the x angle value, the y angle value, the target first weight value, and the target second weight value to obtain a target angle value;

According to the mapping relationship between the preset angle value and the angle quality evaluation value, the image quality evaluation value corresponding to the target angle value is determined.

Optionally, the acquiring two weight values corresponding to the two-dimensional angle value includes:

Obtain the target environment brightness value;

According to the mapping relationship between the preset environmental brightness value and the mapping relationship, determine the target mapping relationship corresponding to the target environmental brightness value, each mapping relationship includes a first mapping between the angle value in the x direction and the first weight value relationship;

Determine the first target weight value corresponding to the x angle value according to the target mapping relationship;

The target second weight value is determined according to the target first weight value.

A second aspect of an embodiment of the present application provides an image processing apparatus, including:

The acquisition unit is used to acquire a video stream in a specified area through a single camera;

A sampling unit, configured to sample the video stream to obtain multiple video images;

A preprocessing unit, configured to preprocess the multiple video images to obtain the preprocessed multiple video images;

An extraction unit, configured to perform depth feature extraction based on the pre-processed multiple video images to obtain a feature set;

A generating unit, configured to generate a depth map according to the feature set;

The processing unit is configured to process the depth map according to point cloud data processing technology to obtain a 3D image.

In a third aspect, an embodiment of the present application provides an electronic device, including a processor, a memory, and one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the processor, The above program includes instructions for performing the steps in the first aspect of the embodiments of the present application.

According to a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, wherein the computer-readable storage medium stores a computer program for electronic data exchange, wherein the computer program causes the computer to execute the first embodiment of the present application. Part or all of the steps described in one aspect.

In a fifth aspect, an embodiment of the present application provides a computer program product, wherein the computer program product includes a non-transitory computer-readable storage medium that stores the computer program, and the computer program is operable to cause the computer to execute as implemented in the present application Examples of some or all of the steps described in the first aspect. The computer program product may be a software installation package.

The implementation of the embodiments of the present application has the following beneficial effects:

It can be seen that through the image processing method and related products described in the embodiments of the present application, a single camera is used to obtain a video stream in a specified area, the video stream is sampled, multiple video images are obtained, and multiple video images are preprocessed. Obtain multiple pre-processed video images, perform depth feature extraction based on the pre-processed multiple video images, obtain a feature set, generate a depth map according to the feature set, and process the depth map according to point cloud data processing technology to obtain a 3D image In this way, a single camera can be used to collect video images, and after sampling, preprocessing, and feature extraction, a feature set is obtained, which is converted into a depth map, and a 3D scene graph is realized through point cloud data processing technology, which further reduces Three-dimensional reproduction costs.

BRIEF DESCRIPTION

In order to more clearly explain the technical solutions in the embodiments of the present application, the following will briefly introduce the drawings used in the description of the embodiments. Obviously, the drawings in the following description are some embodiments of the present application. Those of ordinary skill in the art can obtain other drawings based on these drawings without creative work.

1A is a schematic flowchart of an embodiment of an image processing method provided by an embodiment of the present application;

1B is a schematic structural diagram of a preset convolutional neural network provided by an embodiment of the present application;

1C is a demonstration effect diagram of any video image provided by an embodiment of the present application;

1D is a depth map of any video image in FIG. 1C provided by an embodiment of the present application;

1E is a simple schematic diagram of a point cloud data processing technology provided by an embodiment of the present application;

2 is a schematic flowchart of another embodiment of an image processing method according to an embodiment of the present application;

3 is a schematic structural diagram of an embodiment of an image processing apparatus provided by an embodiment of the present application;

4 is a schematic structural diagram of an embodiment of an image processing apparatus provided by an embodiment of the present application.

detailed description

It should be noted that the electronic device in the embodiment of the present application can be connected to multiple cameras, each camera can be used to capture video images, and each camera can have a corresponding position mark, or, there can be a The corresponding number. Generally, cameras can be installed in public places, such as schools, museums, intersections, pedestrian streets, office buildings, garages, airports, hospitals, subway stations, stations, bus platforms, supermarkets, hotels, entertainment venues, and so on. After the camera captures the video image, the video image can be saved to the memory of the system where the electronic device is located. Multiple image libraries can be stored in the memory, and each image library can contain different video images of the same person. Of course, each image library can also be used to store video images of an area or video images taken by a specified camera.

Further optionally, in the embodiment of the present application, each frame of video image captured by the camera corresponds to one piece of attribute information, and the attribute information is at least one of the following: the shooting time of the video image, the position of the video image, and the attribute parameters of the video image ( Format, size, resolution, etc.), the number of the video image, and the character attributes of the video image. The character characteristic attributes in the video image may include, but are not limited to: the number of characters in the video image, the position of the character, the angle value of the character, the age, the image quality, and so on.

In the embodiment of the present application, the requirements on the device are very low, and only a single camera capable of capturing RGB images or videos is needed to complete the data collection and point cloud generation, and then send the point cloud data and the original RGB images to the subsequent package. Three-dimensional reconstruction of the scene can be achieved in the process. The scene 3D reconstruction technology based on single camera depth of field prediction can be divided into: video stream acquisition, image preprocessing, depth feature extraction and scene depth map generation, depth map-based point cloud data generation, RGB image and point cloud data matching fusion, 3D Six modules are generated on the surface of the object. Among them, the acquisition of video streams and the matching and fusion of subsequent RGB images and point cloud data, and the relatively mature generation technology of the surface of three-dimensional objects, this application can optimize the method of generating point cloud data from the scene, greatly reducing its requirements for equipment and computing power.

Please refer to FIG. 1A, which is a schematic flowchart of an embodiment of an image processing method according to an embodiment of the present application. The image processing method described in this embodiment includes the following steps:

101. Obtain a video stream in a specified area through a single camera.

Wherein, in the embodiment of the present application, the electronic device may include a single camera, and the single camera may be a visible light camera. The specified area can be set by the user or the system default. In a specific implementation, the electronic device can shoot a specified area at a preset time interval through a single camera to obtain a video stream, and the preset time interval can be set by the user or the system default.

102. Sampling the video stream to obtain multiple video images.

In a specific implementation, the electronic device can capture the video stream collected by the camera after the camera is turned on, and perform frame extraction processing on the acquired video stream, that is, the video stream is sampled according to a preset sampling frequency to obtain multiple video images. The sampling frequency can be set by the user or the system default.

103. Pre-process the multiple video images to obtain the pre-processed multiple video images.

Wherein, the above preprocessing may include at least one of the following: scaling processing, noise reduction processing, image enhancement processing, etc., which is not limited herein. Specifically, the preprocessing may be to scale the size of the image, and the framed image is scaled and expanded to a height of 224 pixels and a width of 320 pixels into a feature extraction network for feature extraction.

104. Perform depth feature extraction according to the pre-processed multiple video images to obtain a feature set.

Among them, the electronic device can perform depth feature extraction on the pre-processed multiple video images. Specifically, multiple pre-processed video images can be input to a preset convolutional neural network to perform deep feature extraction to obtain a feature set.

Optionally, in the above step 104, performing depth feature extraction based on the multiple video images to obtain a feature set may include the following steps:

41. Perform image quality evaluation on each video image in the pre-processed multiple video images to obtain multiple image quality evaluation values;

42. Select a maximum value from the plurality of image quality evaluation values, and input the preprocessed video image corresponding to the maximum value to a preset convolutional neural network to obtain a feature set.

In the embodiment of the present application, the preset convolutional neural network may include operations such as convolution, pooling, and normalization. The purpose of these operations is to extract image features, remove image redundant information, and speed up the network speed. The extracted features include the outline, texture, and surface information of each object in the image, the edge information of the connection between the object and the object, and the position information of the existing object in the entire scene. Finally, a feature image containing the entire image information is generated. In a specific implementation, the image quality evaluation can be performed on each of the pre-processed multiple video images to obtain multiple image quality evaluation values. Furthermore, the maximum value among the image quality evaluation values can be selected and the The preprocessed video image corresponding to the maximum value is input to a preset convolutional neural network to obtain a feature set.

Optionally, in step 41 above, image quality evaluation is performed on each of the pre-processed multiple video images to obtain multiple image quality evaluation values, which can be implemented as follows:

At least one image quality evaluation index may be used to perform image quality evaluation on each of the pre-processed multiple video images to obtain multiple image quality evaluation values.

Among them, the image quality evaluation indicators may include, but are not limited to: average grayscale, mean square deviation, entropy, edge retention, signal-to-noise ratio, and so on. It can be defined that the larger the obtained image quality evaluation value, the better the image quality.

It should be noted that since the evaluation of image quality using a single evaluation index has certain limitations, multiple image quality evaluation indexes can be used to evaluate image quality. Of course, when evaluating image quality, it is not an image The more quality evaluation indexes, the better, because the more image quality evaluation indexes, the higher the computational complexity of the image quality evaluation process, and the better the image quality evaluation effect may not be. Therefore, in the case of higher image quality evaluation requirements , You can use 2 to 10 image quality evaluation indicators to evaluate the image quality. Specifically, the number and index of image quality evaluation indicators are selected according to the specific implementation. Of course, the image quality evaluation index has to be selected in conjunction with the specific scene, and the image quality index selected for image quality evaluation in a dark environment and image quality evaluation in a bright environment may be different.

Optionally, when the accuracy of image quality evaluation is not high, an image quality evaluation index can be used for evaluation. For example, when entropy is used as the image quality evaluation index, the greater the entropy, the better the image quality of the face Well, conversely, the smaller the entropy, the worse the quality of the face image.

Optionally, when the image quality evaluation accuracy is required to be high, multiple image quality evaluation indicators may be used to evaluate the image to be evaluated. The weight of each image quality evaluation index in the image quality evaluation indexes can obtain multiple image quality evaluation values, and the final image quality evaluation value can be obtained according to the multiple image quality evaluation values and their corresponding weights, for example, three The image quality evaluation indicators are: A index, B index and C index, the weight of A is a1, the weight of B is a2, and the weight of C is a3, when A, B and C are used to evaluate the image quality of an image, The image quality evaluation value corresponding to A is b1, the image quality evaluation value corresponding to B is b2, and the image quality evaluation value corresponding to C is b3. Then, the final image quality evaluation value = a1b1+a2b2+a3b3. In general, the larger the image quality evaluation value, the better the face image quality.

Optionally, the preset convolutional neural network includes N downsampling layers, N upsampling layers, and convolutional layers, where N is an integer greater than 1; step 42 above, corresponding to the maximum value after preprocessing The video image is input to the preset convolutional neural network to obtain the feature set, which may include the following steps:

421. Perform N downsampling on the preprocessed video image corresponding to the maximum value through the N downsampling layers to obtain a downsampled video image. At least one of the N downsampling includes the following at least One operation: convolution operation, pooling operation and normalization operation;

422. Perform up-sampling on the down-sampled video image through the N up-sampling layers to obtain an up-sampled video image;

423. Perform a convolution operation on the up-sampled video image through the convolution layer to obtain the feature set.

In the embodiment of the present application, the preset convolutional neural network may include N down-sampling layers, N up-sampling layers, and convolution layers, where N is an integer greater than 1. The foregoing preset convolutional neural network can be understood as an encoding-decoding network. The above-mentioned N down-sampling layers can be understood as an encoding process, and the above-mentioned N up-sampling layers and convolution layers can be understood as a decoding process.

As shown in FIG. 1B, the encoding process (in the dashed frame on the left) is feature extraction, and the feature image is obtained by four downsampling. Downsampling includes operations such as convolution, pooling, and normalization. I don't know the specific number and specifications. I will add it to you if necessary. The number of downsampling is obtained through experiments, taking into account the speed and accuracy of the algorithm. Theoretically, the more sampling times the accuracy will increase but the overall speed will decrease, so four times are used to balance speed and accuracy. In the process of downsampling, the image size will be reduced. For example, if the image I input is 224*320, the length and width of the image after each downsampling will become 1/2 of the original, that is, After four downsampling, the image is only 7*10, so the size of the image needs to be restored through the decoding (upsampling) network on the right, and the process of matching the extracted feature image to the depth image is also completed. The number of upsampling is the same as downsampling, taking into account the balance of accuracy and speed, and finally takes four.

In addition, the straight line connecting the down-sampling and the up-sampling represents a "skip-connection", which can improve the accuracy of the algorithm.

Optionally, in the above step 104, depth feature extraction is performed according to the pre-processed multiple video images to obtain a feature set, which may be implemented as follows:

The multiple video images are input to a preset convolutional neural network to obtain a feature set.

In the embodiment of the present application, the preset convolutional neural network may include operations such as convolution, pooling, and normalization. The purpose of these operations is to extract image features, remove image redundant information, and speed up the network speed. The extracted features include the outline, texture, and surface information of each object in the image, the edge information of the connection between the object and the object, and the position information of the existing object in the entire scene. Finally, a feature image containing the entire image information is generated.

In the above step 41, image quality evaluation is performed on each of the pre-processed multiple video images to obtain multiple image quality evaluation values, including:

411. Perform image segmentation on the video image i to obtain a target face image, where the video image i is any frame of the plurality of video images after the preprocessing;

412. Acquire a target face image, and obtain a two-dimensional angle value of the target face image, where the two-dimensional angle value includes an x angle value and a y angle value;

413. Obtain two weight values corresponding to the two-dimensional angle value, wherein the target first weight value corresponding to the x angle value, the target second weight value corresponding to the y angle value, and the target first weight The sum of the value and the second weight of the target is 1;

414. Perform a weighted operation according to the x angle value, the y angle value, the target first weight value, and the target second weight value to obtain a target angle value;

415. Determine the image quality evaluation value corresponding to the target angle value according to a preset mapping relationship between the angle value and the angle quality evaluation value.

In this embodiment of the present application, the electronic device can perform image segmentation on any video image to obtain a face image. There is a certain angle between the face image and the camera. Since it is a planar image, it corresponds to a two-dimensional space coordinate system , The x angle value in the x direction, and the y angle value in the y direction. In this way, the angle relationship between the camera and the face image can be accurately described. Different angles affect the recognition accuracy to a certain extent. For example, the face angle directly affects the number of feature points or the quality of feature points. The above two-dimensional angle value can be understood as the two-dimensional angle between the face and the camera. Each of the two-dimensional angle values may correspond to a weight value. Of course, the two weight values corresponding to the two-dimensional angle value may be preset or the system defaults. The first target weight value corresponding to the x-angle value and the second target weight value corresponding to the y-angle value, the first target weight value + the second target weight value=1.

Further, the target angle value = x angle value * target first weight value + y angle value * target second weight value, in this way, a two-dimensional angle value can be converted into a one-dimensional angle value, which is used to achieve The angle is accurately expressed.

Optionally, in step 102 above, obtaining two weights corresponding to the two-dimensional angle value may include the following steps:

21. Obtain the target environment brightness value;

22. According to the mapping relationship between the preset environmental brightness value and the mapping relationship, determine the target mapping relationship corresponding to the target environmental brightness value, and each mapping relationship includes the number between the angle value in the x direction and the first weight value. A mapping relationship;

23. Determine the first target weight value corresponding to the x angle value according to the target mapping relationship;

24. Determine the second target weight according to the first target weight.

Among them, in specific implementation, the target ambient brightness value can be obtained through an ambient light sensor, and the mapping relationship between the preset ambient brightness value and the mapping relationship can also be stored in advance, and each mapping relationship can include the angle value in the x direction and the first A first mapping relationship between weights, and further, the target mapping relationship corresponding to the target environmental brightness value can be determined according to the mapping relationship between the preset environmental brightness value and the mapping relationship, and the corresponding x angle value can be determined according to the target mapping relationship Target first weight, target second weight = 1-target first weight, due to different ambient light, the angle of the face that can be recognized is different, so, according to the ambient light, you can determine the light Corresponding weights are conducive to the accurate evaluation of the face. Of course, for different ambient light, the corresponding evaluation rules are different, which is conducive to accurately evaluating the angle of the face.

104. Determine the first target evaluation value corresponding to the target angle value according to a preset mapping relationship between the angle value and the angle quality evaluation value.

Wherein, the face evaluation device may pre-store the mapping relationship between the preset angle value and the angle quality evaluation value, and then, according to the mapping relationship, determine the first target evaluation value corresponding to the target angle value, further, such as the first If the target evaluation value is greater than the preset evaluation threshold, it can be understood that the face image is easy to recognize and will be recognized to a large extent. Of course, the face corresponding to this angle can be used to unlock the face, or, such an angle The corresponding face can be used for camera collection, which improves the face collection efficiency of the face evaluation device.

105. Generate a depth map according to the feature set.

Among them, the above-mentioned feature set is also called a feature map. The feature map is not the final depth image, so the decoding network is necessary. In a depth image, the value of each point is not the pixel value of a regular image, but the distance of the point from the camera in millimeters. The figure below is an example of an RGB image and a depth map. As shown in FIG. 1C, FIG. 1C shows a frame of video image, and FIG. 1D is a depth map, which is presented as a grayscale image. The grayscale image is displayed after the distance values in the depth map are processed by correlation processing. The further away from the lens, the lower the gray value, the closer the color looks to black. Conversely, the closer the point to the lens, the greater the gray value, and the closer the color appears to white.

Optionally, the feature set includes multiple feature points, and each feature point includes coordinate position, feature direction, and feature size; the above step 105, generating a depth map according to the feature set may include the following steps:

51. Calculate the feature value according to the feature direction and feature size of each feature point in the feature set to obtain multiple target feature values, and each feature point corresponds to a feature value;

52. Determine a target depth value corresponding to each target feature value of the multiple target feature values according to a preset mapping relationship between the feature value and the depth value, to obtain multiple target depth values, each corresponding to the target depth value A coordinate position;

53. Construct the depth map according to the plurality of target depth values.

Among them, the above feature set may include multiple feature points, each feature point includes coordinate position, feature size and feature direction, because the feature point is a vector, therefore, you can calculate the feature value from the feature size and feature direction, so, you can calculate A feature value corresponding to each feature point in the feature set is obtained to obtain multiple target feature values, and each feature point corresponds to a feature value. The electronic device can also pre-store the mapping relationship between the preset feature value and the depth value, and further, the target depth value corresponding to each target feature value in the multiple target feature values can be determined according to the mapping relationship to obtain multiple targets Depth value, each target depth value corresponds to a coordinate position, and a depth map is constructed based on multiple target depth values. In this way, feature points can be established to construct a depth map.

106. Process the depth map according to point cloud data processing technology to obtain a 3D image.

Wherein, each point in the depth map is the distance from the camera to each point in the original image. The point cloud generation is essentially the mapping of points between different coordinate systems, that is, the process of mapping any coordinate m (u, v) in a two-dimensional image to the spatial coordinate M (Xw, Yw, Zw) in a three-dimensional world. As shown in Figure 1E, the final coordinate conversion formula is:

Among them, M (Xw, Yw, Zw) is the world coordinates, m (u, v) is the depth map coordinates, Zc is the value of each point in the depth map is the distance of the point from the camera. u0 and v0 are the coordinate values of the center of the two-dimensional image. dx and dy convert distance units to meters, and 1000 if the distance value is millimeter units. f is the focal length of the camera lens. Through this calculation, the conversion from the two-dimensional depth map to the three-dimensional map, namely the point cloud, can be realized. Finally, point cloud data processing technology can be used to combine with the original RGB image to achieve 3D reconstruction.

It can be seen that, through the image processing method described in the embodiment of the present application, a single camera is used to obtain a video stream in a specified area, the video stream is sampled to obtain multiple video images, and multiple video images are preprocessed to obtain preprocessing After the multiple video images, the depth feature extraction is performed according to the pre-processed multiple video images to obtain the feature set, the depth map is generated according to the feature set, and the depth map is processed according to the point cloud data processing technology to obtain a 3D image. Capable of collecting video images through a single camera, and after sampling, preprocessing, and feature extraction, a feature set is obtained, which is converted into a depth map, and a 3D scene map is realized through point cloud data processing technology, which further reduces the three-dimensional weight Current costs.

Consistent with the above, please refer to FIG. 2, which is a schematic flowchart of an embodiment of an image processing method according to an embodiment of the present application. The image processing method described in this embodiment includes the following steps:

201. Obtain a video stream in a specified area through a single camera.

202. Sampling the video stream to obtain multiple video images.

203. Preprocess the multiple video images to obtain the preprocessed multiple video images;

204. Perform image quality evaluation on each video image in the plurality of pre-processed video images to obtain multiple image quality evaluation values.

205. Select a maximum value from the plurality of image quality evaluation values, and input the preprocessed video image corresponding to the maximum value to a preset convolutional neural network to obtain a feature set.

206. Generate a depth map according to the feature set.

207. Process the depth map according to the point cloud data processing technology to obtain a 3D image.

For the image processing method described in the above steps 201-207, reference may be made to the corresponding steps of the image processing method described in FIG. 1A.

It can be seen that, through the image processing method described in the embodiment of the present application, a single camera is used to obtain a video stream in a specified area, the video stream is sampled to obtain multiple video images, and multiple video images are preprocessed to obtain preprocessing After multiple pre-processed video images, the image quality evaluation is performed on each of the pre-processed multiple video images to obtain multiple image quality evaluation values, and the maximum value is selected from the multiple image quality evaluation values, and the The preprocessed video image corresponding to the maximum value is input to a preset convolutional neural network to obtain a feature set, a depth map is generated according to the feature set, and the depth map is processed according to the point cloud data processing technology to obtain a 3D image. A single camera collects video images, and after sampling, preprocessing, and feature extraction, a feature set is obtained, which is converted into a depth map, and a 3D scene map is realized through point cloud data processing technology, which further reduces the cost of 3D reproduction .

Consistent with the above, the following is an apparatus for implementing the above image processing method, specifically as follows:

Please refer to FIG. 3, which is a schematic structural diagram of an embodiment of an image processing apparatus according to an embodiment of the present application. The image processing device described in this embodiment includes: an acquisition unit 301, a sampling unit 302, a preprocessing unit 303, an extraction unit 304, a generation unit 305, and a processing unit 306, as follows:

The obtaining unit 301 is configured to obtain a video stream in a specified area through a single camera;

The sampling unit 302 is configured to sample the video stream to obtain multiple video images;

A pre-processing unit 303, configured to pre-process the multiple video images to obtain the pre-processed multiple video images;

The extraction unit 304 is configured to perform depth feature extraction based on the pre-processed multiple video images to obtain a feature set;

The generating unit 305 is configured to generate a depth map according to the feature set;

The processing unit 306 is configured to process the depth map according to a point cloud data processing technology to obtain a 3D image.

It can be seen that through the image processing device described in the embodiment of the present application, a single camera is used to obtain a video stream in a specified area, the video stream is sampled to obtain multiple video images, and multiple video images are preprocessed to obtain preprocessing After the multiple video images, the depth feature extraction is performed according to the pre-processed multiple video images to obtain the feature set, the depth map is generated according to the feature set, and the depth map is processed according to the point cloud data processing technology to obtain a 3D image. Capable of collecting video images through a single camera, and after sampling, preprocessing, and feature extraction, a feature set is obtained, which is converted into a depth map, and a 3D scene map is realized through point cloud data processing technology, which further reduces the three-dimensional weight Current costs.

Wherein, the above obtaining unit 301 can be used to implement the method described in step 101 above, the sampling unit 302 can be used to implement the method described in step 102 above, the above preprocessing unit 303 can be used to implement the method described in step 103 above, the above extraction unit 304 may be used to implement the method described in step 104 above, the generation unit 305 may be used to implement the method described in step 105 above, the processing unit 306 may be used to implement the method described in step 106 above, and so on.

Optionally, in terms of performing depth feature extraction based on the multiple video images to obtain a feature set, the extraction unit 304 is specifically configured to:

Optionally, the preset convolutional neural network includes N downsampling layers, N upsampling layers, and convolutional layers, where N is an integer greater than 1;

In terms of inputting the preprocessed video image corresponding to the maximum value to a preset convolutional neural network to obtain a feature set, the extraction unit 304 is specifically used to:

Performing N downsampling on the preprocessed video image corresponding to the maximum value through the N downsampling layers to obtain a downsampled video image, at least one of the N downsampling includes at least one of the following operations : Convolution operation, pooling operation and normalization operation;

Performing up-sampling on the down-sampled video image through the N up-sampling layers to obtain an up-sampled video image;

The convolution layer performs a convolution operation on the up-sampled video image to obtain the feature set.

In terms of performing image quality evaluation on each of the pre-processed multiple video images to obtain multiple image quality evaluation values, the extraction unit 304 is specifically configured to:

Optionally, the feature set includes multiple feature points, and each feature point includes a coordinate position, feature direction, and feature size;

In terms of generating a depth map according to the feature set, the generating unit 305 is specifically configured to:

Calculate feature values according to the feature direction and feature size of each feature point in the feature set, and obtain multiple target feature values, each feature point corresponding to a feature value;

According to the mapping relationship between the preset feature value and the depth value, the target depth value corresponding to each target feature value in the multiple target feature values is determined to obtain multiple target depth values, each target depth value corresponding to a coordinate position;

The depth map is constructed according to the plurality of target depth values.

It can be understood that the functions of each program module of the image processing apparatus of this embodiment may be specifically implemented according to the method in the above method embodiments, and the specific implementation process may refer to the related description of the above method embodiments, which will not be repeated here.

Consistent with the above, please refer to FIG. 4, which is a schematic structural diagram of an embodiment of an electronic device according to an embodiment of the present application. The electronic device described in this embodiment includes: at least one input device 1000; at least one output device 2000; at least one processor 3000, such as a CPU; and memory 4000, the above input device 1000, output device 2000, processor 3000 and The memory 4000 is connected through a bus 5000.

The input device 1000 may specifically be a touch panel, physical buttons, or a mouse.

The above output device 2000 may specifically be a display screen.

The above-mentioned memory 4000 may be a high-speed RAM memory or a non-volatile memory (non-volatile memory), such as a magnetic disk memory. The above memory 4000 is used to store a set of program codes, and the above input device 1000, output device 2000, and processor 3000 are used to call the program codes stored in the memory 4000, and perform the following operations:

The aforementioned processor 3000 is used for:

Obtain the video stream of the specified area through a single camera;

Sampling the video stream to obtain multiple video images;

Generate a depth map according to the feature set;

It can be seen that through the electronic device described in the embodiments of the present application, a single camera is used to obtain a video stream in a specified area, the video stream is sampled to obtain multiple video images, and multiple video images are preprocessed to obtain preprocessed For multiple video images, the depth feature extraction is performed based on the pre-processed multiple video images to obtain a feature set, a depth map is generated according to the feature set, and the depth map is processed according to the point cloud data processing technology to obtain a 3D image. A single camera is used to collect video images, and after sampling, preprocessing, and feature extraction, a feature set is obtained, which is converted into a depth map, and a 3D scene map is realized through point cloud data processing technology, which further reduces the three-dimensional reproduction cost.

Optionally, in terms of extracting depth features based on the multiple video images to obtain a feature set, the processor 3000 is specifically used to:

In terms of inputting the preprocessed video image corresponding to the maximum value to a preset convolutional neural network to obtain a feature set, the processor 3000 is specifically used to:

In terms of performing image quality evaluation on each of the pre-processed multiple video images to obtain multiple image quality evaluation values, the processor 3000 is specifically used to perform image segmentation on the video image i to obtain A target face image, the video image i is any frame of the pre-processed multiple video images; obtain a target face image, and obtain a two-dimensional angle value of the target face image , The two-dimensional angle value includes an x-angle value and a y-angle value; two weight values corresponding to the two-dimensional angle value are obtained, where the target first weight value corresponding to the x-angle value and the y angle value Corresponding target second weight, the sum of the target first weight and the target second weight is 1; according to the x angle value, the y angle value, the target first weight value, all The target second weight value is weighted to obtain the target angle value; according to the preset mapping relationship between the angle value and the angle quality evaluation value, the image quality evaluation value corresponding to the target angle value is determined.

An embodiment of the present application further provides a computer storage medium, where the computer storage medium may store a program, and when the program is executed, it includes some or all steps of any one of the image processing methods described in the foregoing method embodiments.

Claims

An image processing method, which includes:

Obtain the video stream of the specified area through a single camera;

Sampling the video stream to obtain multiple video images;

Preprocessing the multiple video images to obtain the preprocessed multiple video images;

Performing depth feature extraction according to the pre-processed multiple video images to obtain a feature set;

Generate a depth map according to the feature set;

The depth map is processed according to point cloud data processing technology to obtain a 3D image.
The method according to claim 1, wherein the deep feature extraction based on the pre-processed multiple video images to obtain a feature set includes:

Performing image quality evaluation on each video image in the pre-processed multiple video images to obtain multiple image quality evaluation values;

A maximum value is selected from the plurality of image quality evaluation values, and the preprocessed video image corresponding to the maximum value is input to a preset convolutional neural network to obtain a feature set.
The method according to claim 2, wherein the preset convolutional neural network includes N downsampling layers, N upsampling layers, and convolutional layers, and N is an integer greater than 1;

The pre-processed video image corresponding to the maximum value is input to a preset convolutional neural network to obtain a feature set, including:

Performing N downsampling on the preprocessed video image corresponding to the maximum value through the N downsampling layers to obtain a downsampled video image, at least one of the N downsampling includes at least one of the following operations : Convolution operation, pooling operation and normalization operation;

Performing up-sampling on the down-sampled video image through the N up-sampling layers to obtain an up-sampled video image;

The convolution layer performs a convolution operation on the up-sampled video image to obtain the feature set.
The method according to claim 2, characterized in that, in the case where each of the plurality of video images includes a human face,

The image quality evaluation is performed on each of the pre-processed multiple video images to obtain multiple image quality evaluation values, including:

Performing image segmentation on the video image i to obtain a target face image, where the video image i is any frame of the plurality of video images after the preprocessing;

Acquiring a target face image, and acquiring a two-dimensional angle value of the target face image, where the two-dimensional angle value includes an x angle value and a y angle value;

Acquiring two weight values corresponding to the two-dimensional angle value, wherein the target first weight value corresponding to the x angle value, the target second weight value corresponding to the y angle value, and the target first weight value are The sum of the second weights of the target is 1;

Performing a weighted operation according to the x angle value, the y angle value, the target first weight value, and the target second weight value to obtain a target angle value;

According to the mapping relationship between the preset angle value and the angle quality evaluation value, the image quality evaluation value corresponding to the target angle value is determined.
The method according to any one of claims 1-4, wherein the feature set includes a plurality of feature points, and each feature point includes a coordinate position, a feature direction, and a feature size;

The generating a depth map according to the feature set includes:

Calculate the feature value according to the feature direction and feature size of each feature point in the feature set to obtain multiple target feature values, each feature point corresponding to a target feature value;

According to the mapping relationship between the preset feature value and the depth value, the target depth value corresponding to each target feature value in the multiple target feature values is determined to obtain multiple target depth values, each target depth value corresponding to a coordinate position;

The depth map is constructed according to the plurality of target depth values.
An image processing device, characterized in that it includes:

The acquisition unit is used to acquire a video stream in a specified area through a single camera;

A sampling unit, configured to sample the video stream to obtain multiple video images;

A preprocessing unit, configured to preprocess the multiple video images to obtain the preprocessed multiple video images;

An extraction unit, configured to perform depth feature extraction based on the pre-processed multiple video images to obtain a feature set;

A generating unit, configured to generate a depth map according to the feature set;

The processing unit is configured to process the depth map according to point cloud data processing technology to obtain a 3D image.
The device according to claim 6, characterized in that, in terms of performing depth feature extraction based on the multiple video images to obtain a feature set, the extraction unit is specifically configured to:

Performing image quality evaluation on each video image in the pre-processed multiple video images to obtain multiple image quality evaluation values;

A maximum value is selected from the plurality of image quality evaluation values, and the preprocessed video image corresponding to the maximum value is input to a preset convolutional neural network to obtain a feature set.
The apparatus according to claim 7, wherein the preset convolutional neural network includes N downsampling layers, N upsampling layers, and convolutional layers, and N is an integer greater than 1;

In terms of inputting the preprocessed video image corresponding to the maximum value to a preset convolutional neural network to obtain a feature set, the extraction unit is specifically used to:

Performing N downsampling on the preprocessed video image corresponding to the maximum value through the N downsampling layers to obtain a downsampled video image, at least one of the N downsampling includes at least one of the following operations : Convolution operation, pooling operation and normalization operation;

Performing up-sampling on the down-sampled video image through the N up-sampling layers to obtain an up-sampled video image;

The convolution layer performs a convolution operation on the up-sampled video image to obtain the feature set.
An electronic device, characterized in that it includes a processor and a memory, and the memory is used to store one or more programs and is configured to be executed by the processor, and the program includes a program for executing claims 1-5 Instructions for steps in any of the methods described.
A computer-readable storage medium storing a computer program, the computer program is executed by a processor to implement the method according to any one of claims 1-5.