CN116962707A

CN116962707A - Image processing method, apparatus, system, storage medium, and program product

Info

Publication number: CN116962707A
Application number: CN202311036127.4A
Authority: CN
Inventors: 施杰; 王宝山; 王驰
Original assignee: Guangzhou Ankai Microelectronics Co ltd
Current assignee: Guangzhou Ankai Microelectronics Co ltd
Priority date: 2023-08-16
Filing date: 2023-08-16
Publication date: 2023-10-27

Abstract

The present application relates to an image processing method, apparatus, system, storage medium, and program product. The method comprises the following steps: acquiring a plurality of monocular video frame images output by a shooting module in real time, and respectively encoding each monocular video frame image by utilizing an encoding module to obtain a plurality of monocular encoded video frame images; generating a transmission data packet according to the plurality of monocular coded video frame images and the camera distortion information, and sending the transmission data packet to a decoding module; after decoding each monocular coded video frame image in the transmission data packet by using the decoding module, correcting each monocular decoded video frame image obtained by the decoding processing based on the camera distortion information, and generating a target multi-eye video frame image based on each corrected frame image after the correction processing. The method can avoid the increase of hardware cost and can increase the definition of the multi-view images.

Description

Image processing method, apparatus, system, storage medium, and program product

Technical Field

The present application relates to the field of image processing technology, and in particular, to an image processing method, apparatus, system, storage medium, and program product.

Background

Along with the continuous development of the technical field of image processing, application scenes of the multi-view images are wider and wider, and the multi-view images have important significance in multiple fields. As high-resolution video has begun to spread, 4K and 8K video coding has been proposed and applied to scenes such as binocular or panoramic multi-view video live broadcasting or VR.

In the related art, the resolution of the multi-view image is higher, the hardware requirement for realizing the encoding and decoding of the multi-view image is higher, and the balance between the definition of the multi-view image and the hardware cost cannot be realized.

Disclosure of Invention

In view of the foregoing, it is desirable to provide an image processing method, apparatus, system, storage medium, and program product that can improve the sharpness of a multi-view image without increasing hardware costs.

In a first aspect, the present application provides an image processing method. The method is applied to an image processing system, the image processing system comprises a shooting module, an encoding module and a decoding module, and the method comprises the following steps:

acquiring a plurality of monocular video frame images output by the shooting module in real time, and respectively encoding each monocular video frame image by utilizing an encoding module to obtain a plurality of monocular encoded video frame images;

Generating a transmission data packet according to the plurality of monocular coded video frame images and camera distortion information, and sending the transmission data packet to the decoding module;

and after decoding each monocular coded video frame image in the transmission data packet by utilizing the decoding module, correcting each monocular decoded video frame image obtained by the decoding processing based on the camera distortion information, and generating a target multi-eye video frame image based on each corrected frame image after the correction processing.

In one embodiment, the camera distortion information includes independent distortion information corresponding to different cameras in the shooting module, and the process of obtaining the camera distortion information includes: acquiring preset correction video frame images corresponding to different cameras, and determining pixel coordinates of each pixel point of each preset correction video frame image; respectively inputting pixel coordinates of each pixel point of each preset correction video frame image into a preset conversion model to obtain pixel coordinates of each pixel point in the distorted video frame image corresponding to each preset correction video frame image; and taking pixel coordinates of each pixel point in each distorted video frame image as independent distortion information corresponding to each camera.

In one embodiment, the camera distortion information further includes associated distortion information between cameras, and the process for obtaining the camera distortion information includes: determining initial calibration parameters among cameras; and determining associated distortion information among the cameras according to the initial calibration parameters and the distortion coefficients.

In one embodiment, the generating a transmission data packet according to the plurality of monocular encoded video frame images and camera distortion information includes: and encapsulating the plurality of monocular coded video frame images and the camera distortion information by using a network abstraction layer to obtain the transmission data packet.

In one embodiment, the correcting process for each monocular decoded video frame image obtained by the decoding process based on the camera distortion information includes: primarily correcting distorted pixel points in corresponding monocular decoded video frame images according to independent distortion information respectively corresponding to different cameras to obtain a plurality of candidate monocular decoded video frame images; and carrying out secondary correction on distorted pixel points of each candidate monocular decoded video frame image according to the associated distortion information to obtain a plurality of corrected frame images after correction processing.

In one embodiment, the generating a target multi-view video frame image based on each corrected frame image after the correction processing includes: inputting each correction frame image into a region conversion model to obtain an overlapping region between each correction frame image output by the region conversion model; and performing splicing processing on each correction frame image based on the overlapping area to obtain the target multi-view video frame image.

In a second aspect, the present application also provides an image processing apparatus. The device is applied to an image processing system, the image processing system comprises a shooting module, an encoding module and a decoding module, and the device comprises:

the acquisition module is used for acquiring a plurality of monocular video frame images output by the shooting module in real time, and respectively carrying out coding processing on each monocular video frame image by utilizing the coding module to obtain a plurality of monocular coding video frame images;

the generating module is used for generating a transmission data packet according to the plurality of monocular coded video frame images and the camera distortion information and sending the transmission data packet to the decoding module;

and the splicing module is used for carrying out decoding processing on each monocular coded video frame image in the transmission data packet by utilizing the decoding module, carrying out correction processing on each monocular decoded video frame image obtained by the decoding processing on the basis of the camera distortion information, and generating a target multi-eye video frame image on the basis of each corrected frame image after the correction processing.

In one embodiment, the camera distortion information includes independent distortion information corresponding to different cameras in the shooting module, and the apparatus further includes an information obtaining module, configured to: acquiring preset correction video frame images corresponding to different cameras, and determining pixel coordinates of each pixel point of each preset correction video frame image; respectively inputting pixel coordinates of each pixel point of each preset correction video frame image into a preset conversion model to obtain pixel coordinates of each pixel point in the distorted video frame image corresponding to each preset correction video frame image; and taking pixel coordinates of each pixel point in each distorted video frame image as independent distortion information corresponding to each camera.

In one embodiment, the camera distortion information further includes associated distortion information between cameras, and the information obtaining module is further configured to: determining initial calibration parameters among cameras; and determining associated distortion information among the cameras according to the initial calibration parameters and the distortion coefficients.

In one embodiment, the generating module is specifically configured to: and encapsulating the plurality of monocular coded video frame images and the camera distortion information by using a network abstraction layer to obtain the transmission data packet.

In one embodiment, the splicing module is specifically configured to: primarily correcting distorted pixel points in corresponding monocular decoded video frame images according to independent distortion information respectively corresponding to different cameras to obtain a plurality of candidate monocular decoded video frame images; and carrying out secondary correction on distorted pixel points of each candidate monocular decoded video frame image according to the associated distortion information to obtain a plurality of corrected frame images after correction processing.

In one embodiment, the splicing module is specifically configured to: inputting each correction frame image into a region conversion model to obtain an overlapping region between each correction frame image output by the region conversion model; and performing splicing processing on each correction frame image based on the overlapping area to obtain the target multi-view video frame image.

In a third aspect, the present application also provides an image processing system, where the image processing system includes a shooting module, an encoding module, and a decoding module; the image processing system is for implementing the steps of the method of any of the above first aspects.

In a fourth aspect, the present application also provides a computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of the method of any of the first aspects described above.

In a fifth aspect, the application also provides a computer program product comprising a computer program which, when executed by a processor, implements the steps of the method of any of the first aspects above.

The image processing method, the device, the system, the storage medium and the program product are characterized in that a plurality of monocular video frame images output by a shooting module in real time are obtained, and each monocular video frame image is respectively encoded by an encoding module to obtain a plurality of monocular encoded video frame images; generating a transmission data packet according to the plurality of monocular coded video frame images and the camera distortion information, and sending the transmission data packet to a decoding module; after decoding each monocular coded video frame image in the transmission data packet by using the decoding module, correcting each monocular decoded video frame image obtained by the decoding processing based on the camera distortion information, and generating a target multi-eye video frame image based on each corrected frame image after the correction processing. In this way, in the encoding module, because the monocular encoded video frame image is directly encoded, compared with encoding the spliced target multi-ocular video frame image, the encoding can be realized without hardware upgrade; correspondingly, in the decoding module, the monocular coded video frame image is decoded as well, and the decoding can be realized based on original hardware conditions. Furthermore, the method realizes the correction of the monocular decoding video frame image by a software algorithm, obtains the target multi-view video frame image, ensures the obtaining of the high-definition target multi-view video frame image while avoiding the improvement of hardware cost, and realizes the balance of the image processing system on the multi-view image definition and the hardware cost.

Drawings

In order to more clearly illustrate the embodiments of the application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of an image processing method in one embodiment;

FIG. 2 is a schematic diagram of a distorted image in one embodiment;

FIG. 3 is a flow chart of acquiring camera distortion information in one embodiment;

FIG. 4 is a flowchart of another method for obtaining camera distortion information according to an embodiment;

FIG. 5 is a flow diagram of a correction process in one embodiment;

FIG. 6 is a flow diagram of a stitching process in one embodiment;

FIG. 7 is a flow chart of a method of stitching multiple images in one embodiment;

FIG. 8 is a block diagram showing the structure of an image processing apparatus in one embodiment;

fig. 9 is an internal structural diagram of a computer device in one embodiment.

Detailed Description

In order that the above objects, features and advantages of the application will be readily understood, a more particular description of the application will be rendered by reference to the appended drawings. It will be understood that numerous specific details are set forth in the following description in order to provide a thorough understanding of the present application, but that the present application may be practiced in many other ways other than those described herein, and that persons skilled in the art will be able to make similar modifications without departing from the spirit of the present application, so that the present application is not limited to the specific embodiments disclosed below.

Along with the development of the technical field of computers, application scenes of binocular stitching are wider and wider. The traditional binocular image stitching refers to stitching two frames of images of the same scene into a larger image, and has important significance in the fields of medical imaging, computer vision, satellite data and the like. With the popularization of high-resolution video, 4K and even 8K video coding has been proposed, and the method can be used for binocular or panoramic multi-view video live broadcasting or VR scenes, however, outputting multi-view images or multi-view video frames inevitably increases requirements on an image processing module, video coding hardware, image processing hardware and the like, so that the resolution of the multi-view images is reduced on the basis of original hardware, the definition is reduced and the like, and if the hardware performance is improved, the cost is increased, and the balance of the two cannot be achieved.

In view of this, the embodiment of the application provides an image processing method, which ensures the definition of the high-resolution multi-view image, can reduce the pressure of the image processing hardware and the pressure of the encoding hardware, reduces the performance requirements of the multi-view image processing on the encoder and the image processing module, eliminates the limitation of the splicing of the multi-view image on the hardware, and transfers to a decoding module with fewer limitation and better expansion.

The image processing method provided by the embodiment of the application can be applied to an image processing system, wherein the image processing system comprises a shooting module, an encoding module and a decoding module. The photographing module may include a plurality of photographing apparatuses, each photographing apparatus including a camera. The coding module and the decoding module may be hardware modules in the image processing chip or the electronic device, and the coding module and the decoding module may be implemented by hardware or a combination of software and hardware, and optionally, the coding module and the decoding module may be integrated in the same electronic device. The image processing system may be integrated in the same device. The electronic device may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, internet of things devices, servers and portable wearable devices, and the internet of things devices may be smart speakers, smart televisions, smart air conditioners, smart vehicle devices and the like. The portable wearable device may be a smart watch, smart bracelet, headset, or the like. The server may be implemented as a stand-alone server or as a server cluster composed of a plurality of servers.

In one embodiment, as shown in fig. 1, there is provided an image processing method including the steps of:

Step 101, obtaining a plurality of monocular video frame images output by a shooting module in real time, and respectively carrying out coding processing on each monocular video frame image by using a coding module to obtain a plurality of monocular coded video frame images.

The shooting module can comprise a plurality of shooting devices, each shooting device shoots real-time video respectively, video frames in the real-time video respectively output by each shooting device are monocular video frame images, and therefore video frames in the videos simultaneously output by each shooting device of the shooting module form a plurality of monocular video frame images. It will be appreciated that a group of multiple monocular video frame images may be video frame images at the same point in time in the video captured by different capture devices.

Because the multi-eye video frame image is required to be generated through a plurality of single-eye video frame images and the images are required to be displayed after the decoding module decodes the multi-eye video frame images, the safety and the transmission efficiency of the transmission process and the hardware performance of the encoding module are considered, in the embodiment of the application, after the encoding module is used for encoding each single-eye video frame image respectively, each obtained single-eye encoded video frame image is transmitted to the decoding end.

And 102, generating a transmission data packet according to the plurality of monocular coded video frame images and the camera distortion information, and sending the transmission data packet to a decoding module.

In which the image photographed by the camera of the multi-view photographing apparatus may have distortion, as illustrated in fig. 2, by way of example. Therefore, to ensure that the image shown after decoding is undistorted, it is necessary to determine camera distortion information of each photographing device to correct and restore the video frame image based on the decoded image.

In order to improve image processing efficiency, in the embodiment of the application, the camera distortion information and the monocular encoded video frame image are transmitted to a decoding module together so as to recover and generate a target multi-ocular video frame image after the decoding module decodes the camera distortion information and the monocular encoded video frame image.

Specifically, the encoding module may encapsulate the monocular encoded video frame image and the camera distortion information together into a transmission data packet, and transmit the transmission data packet to the decoding module. It can be understood that, because the shooting module shoots the video in real time, along with the continuous generation of the video, the encoding module continuously generates a plurality of groups of monocular encoded video frame images, and combines the distortion information of the camera to generate a plurality of groups of transmission data packets, and the transmission data packets are transmitted to the decoding module.

Step 103, after decoding each monocular coded video frame image in the transmission data packet by using the decoding module, correcting each monocular decoded video frame image obtained by the decoding processing based on the camera distortion information, and generating a target multi-eye video frame image based on each corrected frame image after the correction processing.

The decoding module receives the transmission data packet transmitted currently and analyzes the transmission data packet to obtain a plurality of monocular coded video frame images and camera distortion information contained in the transmission data packet. Further, decoding processing is performed on each monocular encoded video frame image to obtain a plurality of monocular decoded video frame images. It will be appreciated that the monocular decoded video frame images are substantially identical to the original monocular video frame images.

In order to make the generated target multi-view video frame images better, after decoding, correcting each monocular decoded video frame image based on camera distortion information to obtain a corrected frame image corresponding to each monocular decoded video frame image. Furthermore, the target multi-view video frame image is obtained by directly carrying out splicing processing on each correction frame image through an algorithm, the influence of hardware coding and decoding and the like on the resolution of the target multi-view video frame image is avoided, and the definition, the frame rate and the like of the obtained target multi-view video frame image can be ensured.

The image processing system may save the target multi-view video frame image or present it in real-time.

According to the image processing method, the plurality of monocular video frame images output by the shooting module in real time are obtained, and the encoding module is utilized to encode each monocular video frame image respectively to obtain the plurality of monocular encoded video frame images; generating a transmission data packet according to the plurality of monocular coded video frame images and the camera distortion information, and sending the transmission data packet to a decoding module; after decoding each monocular coded video frame image in the transmission data packet by using the decoding module, correcting each monocular decoded video frame image obtained by the decoding processing based on the camera distortion information, and generating a target multi-eye video frame image based on each corrected frame image after the correction processing. In this way, in the encoding module, because the monocular encoded video frame image is directly encoded, compared with encoding the spliced target multi-ocular video frame image, the encoding can be realized without hardware upgrade; correspondingly, in the decoding module, the monocular coded video frame image is decoded as well, and the decoding can be realized based on original hardware conditions. Furthermore, the method realizes the correction of the monocular decoding video frame image by a software algorithm, obtains the target multi-view video frame image, ensures the obtaining of the high-definition target multi-view video frame image while avoiding the improvement of hardware cost, and realizes the balance of the image processing system on the multi-view image definition and the hardware cost.

The procedure for acquiring the camera distortion information is described below.

Referring to fig. 3, a schematic flow chart of obtaining camera distortion information according to an embodiment of the present application is shown. The camera distortion information comprises independent distortion information respectively corresponding to different cameras in the shooting module, and the acquisition process of the camera distortion information comprises the following steps:

step 301, obtaining preset corrected video frame images corresponding to different cameras, and determining pixel coordinates of each pixel point of each preset corrected video frame image.

In the vision measurement system, there may be geometric distortion, such as linear distortion including translation, scaling, rotation, etc., and nonlinear distortion including Bao Lengjing distortion, tangential distortion, radial distortion, etc. Since the geometric distortion mainly causes the coordinate shift of the image, the relationship between the pixel coordinates of the distorted image and the pixel coordinates of the corrected image (undistorted image) can be described through a preset model, so that each pixel point coordinate of the corrected image can calculate the corresponding pixel point coordinate on the distorted image, and the whole undistorted image can be restored through pixel interpolation through the corresponding coordinate.

In the embodiment of the application, optionally, for each shooting device in the shooting module, a preset video shot by the history of the shooting device is obtained, and an undistorted video frame is cut from the preset video and used as a preset corrected video frame image corresponding to a camera of the shooting device. Or taking a normal image shot in advance by the shooting device as a corresponding preset correction video frame image. Thus, the preset correction video frame image corresponding to the camera of each shooting device can be obtained.

For each preset corrected video frame image, the image processing system may determine pixel coordinates of each pixel of the preset corrected video frame image.

Step 302, the pixel coordinates of each pixel point in each preset corrected video frame image are respectively input to a preset conversion model, so as to obtain the pixel coordinates of each pixel point in the distorted video frame image corresponding to each preset corrected video frame image.

In step 303, the pixel coordinates of each pixel point in each distorted video frame image are respectively used as the independent distortion information corresponding to each camera.

The image processing system can be provided with a preset conversion model which is used for converting the preset correction video frame image into the distortion video frame image, and determining and outputting the pixel coordinates of each pixel point in the distortion video frame image according to the pixel coordinates of each pixel point in the preset correction video frame image for subsequent recovery of the monocular decoding video frame image.

And for each shooting device, inputting the pixel coordinates of each pixel point of the preset correction video frame image corresponding to the camera of the shooting device into a preset conversion model to obtain the pixel coordinates of each pixel point in the output distortion video frame image corresponding to the shooting device, so as to obtain independent distortion information corresponding to the camera of each shooting device.

In the embodiment of the application, the independent distortion information of each camera is accurately determined by describing the pixel coordinate relation between the distorted image and the corrected image, so that references are provided for recovering each monocular decoded video frame image at the decoding end later, and the definition of the output target multi-ocular video frame image is ensured.

Referring to fig. 4, another flow chart for obtaining camera distortion information according to an embodiment of the present application is shown. The camera distortion information also comprises associated distortion information among all cameras, and the acquisition process of the camera distortion information comprises the following steps:

step 401, determining initial calibration parameters between cameras.

And step 402, determining associated distortion information among cameras according to the initial calibration parameters and the distortion coefficients.

For a multi-view device, for example, a wide-angle lens, the edge area of an image may bring about a greater degree of distortion, and there may be angle and distance differences between a plurality of cameras, so that parallax in a multi-view image calculated from a monocular image captured by each camera may become inaccurate. Therefore, in the multi-camera device, distortion information between the cameras also exists. In the embodiment of the application, the cameras can be calibrated respectively, so that initial calibration parameters among the cameras are obtained, and the initial calibration parameters can describe differences of different dimensions among the cameras. Based on the initial calibration parameters, constraint of distortion coefficients is added, so that geometric relations among the multiple cameras are determined, and calibration of the multiple cameras is completed.

Optionally, the initial calibration parameter and the distortion coefficient are multiplied to obtain the associated distortion information.

Optionally, if there are multiple initial calibration parameters, the distortion coefficients corresponding to the initial calibration parameters may be different, that is, the associated distortion information between each group of cameras may be different, so as to obtain the associated distortion information between the cameras in a personalized manner.

In the embodiment of the application, the associated distortion information among the cameras is accurately determined by determining the associated distortion information among the cameras, so that references are provided for the subsequent recovery of the monocular decoded video frame images at the decoding end, and the definition of the output target multi-ocular video frame images is ensured.

In one embodiment, generating a transmission data packet from a plurality of monocular encoded video frame images and camera distortion information comprises: and encapsulating the plurality of monocular coded video frame images and the camera distortion information by using a network abstraction layer to obtain a transmission data packet.

Wherein the network abstraction layer (Network Abstraction Layer, NAL) is part of a video coding picture coding standard. The network abstraction layer may be used to define the encapsulation format of the data, encapsulate the video data into individual NAL unit packets, adapt to different network environments, and transmit. In the embodiment of the application, the NAL can be deployed in the image processing system, after a group of monocular coded video frame images are generated, camera distortion information is called, and the NAL is utilized to package the group of monocular coded video frame images and the camera distortion information, so as to obtain a transmission data packet, thereby ensuring that the transmission data packet can be transmitted in various network environments, ensuring the stability and safety of the transmission process, improving the transmission efficiency in synchronization, and improving the efficiency of processing the multi-view images.

Before the correction processing is performed on the image, preprocessing and registration may be performed on each monocular decoded video frame image first, that is, preprocessing is performed on each monocular decoded video frame image respectively, so as to obtain each preprocessed monocular decoded video frame image. And further, performing registration processing on each preprocessed monocular decoded video frame image, taking the image subjected to registration processing as the monocular decoded video frame image, and continuing subsequent correction processing. Optionally, the preprocessing may include histogram matching, parallel filtering, and/or enhancement transforms. Thus, by preprocessing and image registration, noise effects and effects of transmission processes are removed, and image quality of each monocular decoded video frame image is ensured.

The procedure of the correction process is explained below.

Referring to fig. 5, a flow chart of a correction process according to an embodiment of the application is shown. Correcting each monocular decoded video frame image obtained by decoding based on camera distortion information, comprising:

step 501, primarily correcting distorted pixel points in corresponding monocular decoded video frame images according to independent distortion information respectively corresponding to different cameras to obtain a plurality of candidate monocular decoded video frame images.

As can be seen from the above, the camera of each photographing apparatus corresponds to one independent distortion information. Therefore, for the monocular decoded video frame image corresponding to each shooting device obtained by decoding, the independent distortion information corresponding to the shooting device is adopted to correct the coordinates of each pixel point in the monocular decoded video frame image so as to recover the distorted pixel points, and the candidate monocular decoded video frame image corresponding to each shooting device is obtained.

It will be appreciated that fewer distorted pixels are in the candidate monocular decoded video frame image than in the monocular decoded video frame image.

Step 502, performing secondary correction on distorted pixels of each candidate monocular decoded video frame image according to the associated distortion information, so as to obtain a plurality of corrected frame images after correction processing.

After each monocular decoded video frame image is subjected to distortion correction based on independent distortion information corresponding to each monocular decoded video frame image, correction is further required to be continued based on associated distortion information between associated cameras, namely, a process of performing secondary correction processing. And the image processing system continues to carry out distortion pixel points on each candidate monocular decoding video frame image according to the associated distortion information, so as to obtain a correction frame image corresponding to each monocular video frame image. And correction processing is realized, so that the definition of the obtained target multi-view video frame image is effectively improved.

In one embodiment, please refer to fig. 6, which illustrates a flowchart of a splicing process according to an embodiment of the present application. Generating a target multi-view video frame image based on each corrected frame image after correction processing, comprising:

step 601, inputting each correction frame image into the region conversion model to obtain an overlapping region between each correction frame image output by the region conversion model.

And step 602, performing stitching processing on each corrected frame image based on the overlapping area to obtain a target multi-view video frame image.

In this case, since the distance between cameras of the respective photographing apparatuses may be small, there is an unavoidable overlapping photographing region between the photographed monocular video frame images. If the corrected frame images corresponding to the monocular video frame images are directly spliced, the overlapping area appears multiple times. To avoid this, it is necessary to determine the overlapping area between the correction frame images. In the embodiment of the application, each correction frame image can be input into the region conversion model to obtain the overlapping region between the correction frame images output by the region conversion model. Alternatively, the output of the region conversion model may be the coordinates of the pixel points in the overlapping region, or the position of the overlapping region in the entire correction frame image.

Alternatively, each correction frame image may be simultaneously input to the region conversion model, resulting in an overlapping region, which may be single or multiple.

After the overlapping area is determined, in the splicing process, overlapping and splicing processing can be carried out on the overlapping area in each correction frame image, so that a target multi-view video frame image is obtained.

In the embodiment of the application, the information of the camera related distortion can be stored in the transmission data packet under the condition of keeping the hardware performance limit of the image processing module, the video coding module and the like in the image processing system unchanged. For video coding, the stored distortion information does not affect the coded image, and normal spliced images can be processed through subsequent decoding. Accordingly, distortion information is acquired during decoding, so that image splicing operation is performed at the decoding end according to the acquired distortion information, and the target multi-view video frame image can be displayed, so that the definition of the target multi-view video frame image is not reduced due to the hardware limitation of the image resolution, the splicing of the multi-view image is more flexible, the definition of the multi-view image is ensured on the basis of saving the hardware cost, the frame rate of the multi-view video can be ensured in the process of displaying the multi-view video, and high-quality image processing is realized.

For ease of understanding, the method for stitching multiple images provided by the present application will be described in one completed embodiment. Referring to fig. 7, the method includes:

step 701, obtaining a plurality of monocular video frame images output by a shooting module in real time, and respectively performing coding processing on each monocular video frame image by using a coding module to obtain a plurality of monocular coded video frame images.

Step 702, obtaining independent distortion information and associated distortion information corresponding to different cameras in the shooting module as camera distortion information.

And step 703, packaging the plurality of monocular coded video frame images and the camera distortion information by using a network abstraction layer to obtain a transmission data packet, and sending the transmission data packet to a decoding module.

And step 704, decoding each monocular coded video frame image in the transmission data packet by using a decoding module to obtain each monocular decoded video frame image.

Step 705, primarily correcting distorted pixel points in the corresponding monocular decoded video frame images according to the independent distortion information respectively corresponding to the different cameras, so as to obtain a plurality of candidate monocular decoded video frame images.

And step 706, performing secondary correction on the distorted pixel points of each candidate monocular decoded video frame image according to the associated distortion information to obtain a plurality of corrected frame images after correction processing.

Step 707, inputting each correction frame image into the region conversion model, and obtaining an overlapping region between each correction frame image output by the region conversion model.

Step 708, performing stitching processing on each corrected frame image based on the overlapping area, so as to obtain a target multi-view video frame image.

Step 709, save and/or display the target multi-view video frame image.

In the embodiment of the application, the image processing module, the video coding module and other modules are not spliced at the picture layer, so that the monocular image is used as a basic unit for coding, and finally the splicing is put at a decoding end for carrying out algorithm processing, so that the limitation of hardware is relieved by splicing the binocular or multi-purpose image, the use cost is reduced, more resources are released, the resources are reasonably utilized, the definition is ensured, and the hardware cost is not increased. For example, for an image with a resolution of 8K after binocular stitching, only hardware conditions supporting an image processing capability and a video encoding capability with a resolution of 4K are needed in the present application to stitch the image to obtain a binocular image.

It should be understood that, although the steps in the flowcharts related to the embodiments described above are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.

Based on the same inventive concept, the embodiment of the application also provides an image processing device for realizing the above-mentioned image processing method. The implementation of the solution provided by the apparatus is similar to the implementation described in the above method, so the specific limitation of one or more embodiments of the image processing apparatus provided below may refer to the limitation of the image processing method hereinabove, and will not be repeated herein.

In one embodiment, as shown in fig. 8, there is provided an image processing apparatus 800 including: an acquisition module 801, a generation module 802, and a splicing module 803, wherein:

the acquiring module 801 is configured to acquire a plurality of monocular video frame images output by the capturing module in real time, and encode each monocular video frame image by using the encoding module to obtain a plurality of monocular encoded video frame images;

a generating module 802, configured to generate a transmission data packet according to the plurality of monocular encoded video frame images and camera distortion information, and send the transmission data packet to the decoding module;

and the splicing module 803 is configured to perform decoding processing on each monocular encoded video frame image in the transmission data packet by using the decoding module, perform correction processing on each monocular decoded video frame image obtained by the decoding processing based on the camera distortion information, and generate a target multi-eye video frame image based on each corrected frame image after the correction processing.

In one embodiment, the camera distortion information includes independent distortion information corresponding to different cameras in the shooting module, and the apparatus further includes an information obtaining module 801, configured to: acquiring preset correction video frame images corresponding to different cameras, and determining pixel coordinates of each pixel point of each preset correction video frame image; respectively inputting pixel coordinates of each pixel point of each preset correction video frame image into a preset conversion model to obtain pixel coordinates of each pixel point in the distorted video frame image corresponding to each preset correction video frame image; and taking pixel coordinates of each pixel point in each distorted video frame image as independent distortion information corresponding to each camera.

In one embodiment, the camera distortion information further includes associated distortion information between cameras, and the information obtaining module 801 is further configured to: determining initial calibration parameters among cameras; and determining associated distortion information among the cameras according to the initial calibration parameters and the distortion coefficients.

In one embodiment, the generating module 802 is specifically configured to: and encapsulating the plurality of monocular coded video frame images and the camera distortion information by using a network abstraction layer to obtain the transmission data packet.

In one embodiment, the splicing module 803 is specifically configured to: primarily correcting distorted pixel points in corresponding monocular decoded video frame images according to independent distortion information respectively corresponding to different cameras to obtain a plurality of candidate monocular decoded video frame images; and carrying out secondary correction on distorted pixel points of each candidate monocular decoded video frame image according to the associated distortion information to obtain a plurality of corrected frame images after correction processing.

In one embodiment, the splicing module 803 is specifically configured to: inputting each correction frame image into a region conversion model to obtain an overlapping region between each correction frame image output by the region conversion model; and performing splicing processing on each correction frame image based on the overlapping area to obtain the target multi-view video frame image.

The respective modules in the above-described image processing apparatus may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In one embodiment, a computer device is provided, which may be a server, and the internal structure of which may be as shown in fig. 9. The computer device includes a processor, a memory, an Input/Output interface (I/O) and a communication interface. The processor, the memory and the input/output interface are connected through a system bus, and the communication interface is connected to the system bus through the input/output interface. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is for storing image processing data. The input/output interface of the computer device is used to exchange information between the processor and the external device. The communication interface of the computer device is used for communicating with an external terminal module through a network connection. The computer program is executed by a processor to implement an image processing method.

It will be appreciated by persons skilled in the art that the architecture shown in fig. 9 is merely a block diagram of some of the architecture relevant to the present inventive arrangements and is not limiting as to the computer device to which the present inventive arrangements are applicable, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.

In one embodiment, an image processing system is provided, the image processing system including a capture module, an encoding module, and a decoding module; the image processing system is used to implement the steps of the method in the above embodiments.

In one embodiment, a computer device is provided comprising a memory and a processor, the memory having stored therein a computer program, the processor when executing the computer program performing the steps of:

In one embodiment, the processor when executing the computer program further performs the steps of:

acquiring preset correction video frame images corresponding to different cameras, and determining pixel coordinates of each pixel point of each preset correction video frame image; respectively inputting pixel coordinates of each pixel point of each preset correction video frame image into a preset conversion model to obtain pixel coordinates of each pixel point in the distorted video frame image corresponding to each preset correction video frame image; and taking pixel coordinates of each pixel point in each distorted video frame image as independent distortion information corresponding to each camera.

determining initial calibration parameters among cameras; and determining associated distortion information among the cameras according to the initial calibration parameters and the distortion coefficients.

and encapsulating the plurality of monocular coded video frame images and the camera distortion information by using a network abstraction layer to obtain the transmission data packet.

primarily correcting distorted pixel points in corresponding monocular decoded video frame images according to independent distortion information respectively corresponding to different cameras to obtain a plurality of candidate monocular decoded video frame images; and carrying out secondary correction on distorted pixel points of each candidate monocular decoded video frame image according to the associated distortion information to obtain a plurality of corrected frame images after correction processing.

inputting each correction frame image into a region conversion model to obtain an overlapping region between each correction frame image output by the region conversion model; and performing splicing processing on each correction frame image based on the overlapping area to obtain the target multi-view video frame image.

In one embodiment, a computer readable storage medium is provided having a computer program stored thereon, which when executed by a processor, performs the steps of:

In one embodiment, the computer program when executed by the processor further performs the steps of:

In one embodiment, a computer program product is provided comprising a computer program which, when executed by a processor, performs the steps of:

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (ReRAM), magnetic random access Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (Phase Change Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in the form of a variety of forms, such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), and the like. The databases referred to in the embodiments provided herein may include at least one of a relational database and a non-relational database. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processor referred to in the embodiments provided in the present application may be a general-purpose processor, a central processing unit, a graphics processor, a digital signal processor, a programmable logic unit, a data processing logic unit based on quantum computing, or the like, but is not limited thereto.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The foregoing examples illustrate only a few embodiments of the application and are described in detail herein without thereby limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of the application should be assessed as that of the appended claims.

Claims

1. An image processing method, characterized in that the method is applied to an image processing system, the image processing system comprises a shooting module, an encoding module and a decoding module, and the method comprises the following steps:

acquiring a plurality of monocular video frame images output by the shooting module in real time, and respectively carrying out coding processing on each monocular video frame image by utilizing a coding module to obtain a plurality of monocular coded video frame images;

and after decoding each monocular coded video frame image in the transmission data packet by utilizing the decoding module, correcting each monocular decoded video frame image obtained by decoding based on the camera distortion information, and generating a target multi-eye video frame image based on each corrected frame image after correction.

2. The method according to claim 1, wherein the camera distortion information includes independent distortion information corresponding to different cameras in the shooting module, and the process of obtaining the camera distortion information includes:

acquiring preset correction video frame images corresponding to different cameras, and determining pixel coordinates of each pixel point of each preset correction video frame image;

respectively inputting pixel coordinates of each pixel point of each preset correction video frame image into a preset conversion model to obtain pixel coordinates of each pixel point in the distorted video frame image corresponding to each preset correction video frame image;

And taking pixel coordinates of each pixel point in each distorted video frame image as independent distortion information corresponding to each camera.

3. The method of claim 2, wherein the camera distortion information further comprises associated distortion information between the cameras, the camera distortion information obtaining process comprising:

determining initial calibration parameters among the cameras;

and determining associated distortion information among the cameras according to the initial calibration parameters and the distortion coefficients.

4. The method of claim 1, wherein generating a transmission data packet from the plurality of monocular encoded video frame images and camera distortion information comprises:

5. The method according to claim 3, wherein the correcting the monocular decoded video frame images obtained by the decoding process based on the camera distortion information includes:

primarily correcting distorted pixel points in corresponding monocular decoded video frame images according to independent distortion information respectively corresponding to different cameras to obtain a plurality of candidate monocular decoded video frame images;

And carrying out secondary correction on distorted pixel points of each candidate monocular decoding video frame image according to the associated distortion information to obtain a plurality of corrected frame images after correction processing.

6. The method according to claim 1, wherein the generating a target multi-view video frame image based on each corrected frame image after the correction processing includes:

inputting each correction frame image into a region conversion model to obtain an overlapping region between each correction frame image output by the region conversion model;

and performing splicing processing on each correction frame image based on the overlapping area to obtain the target multi-view video frame image.

7. An image processing apparatus, for use in an image processing system, the image processing system including a capture module, an encoding module, and a decoding module, the apparatus comprising:

the generation module is used for generating a transmission data packet according to the plurality of monocular coded video frame images and the camera distortion information and sending the transmission data packet to the decoding module;

8. An image processing system, characterized in that the image processing system comprises a shooting module, an encoding module and a decoding module; the image processing system being adapted to implement the steps of the method of any of claims 1 to 6.

9. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 6.

10. A computer program product comprising a computer program, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 6.