CN113506320B

CN113506320B - Image processing method and device, electronic equipment and storage medium

Info

Publication number: CN113506320B
Application number: CN202110799900.7A
Authority: CN
Inventors: 施路平; 杨哲宇; 赵蓉
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2021-07-15
Filing date: 2021-07-15
Publication date: 2024-04-12
Anticipated expiration: 2041-07-15
Also published as: CN113506320A

Abstract

The present disclosure relates to an image processing method and apparatus, an electronic device, and a storage medium, the method including: dividing the dynamic visual information in a first time period of the preset scene according to the first time interval to generate a plurality of dynamic visual images of the preset scene; and inputting the first color image and the dynamic visual image in the first time period into an image generation network to obtain a second color image. According to the image processing method of the embodiment of the disclosure, the plurality of dynamic visual information in the first time period can be segmented to generate the plurality of dynamic visual images, and the plurality of dynamic visual images can be reserved in the first time period, so that the movement track of the target in the preset scene is clearer, and the tracking of the track of the moving object is facilitated.

Description

Image processing method and device, electronic equipment and storage medium

Technical Field

The disclosure relates to the field of computer technology, and in particular, to an image processing method and apparatus, an electronic device, and a storage medium.

Background

Dynamic vision receptors (Dynamic visual receptors, DVS) use shutters to control frame rate with normal cameras and record light intensity differently from frame to frame, dynamic vision receptors are sensitive to the rate of change of light intensity, and each pixel can record the amount of change in the light intensity logarithmic value at that pixel location, and when the amount of change exceeds a threshold, a positive or negative pulse is generated. The dynamic vision receptors have a higher frame rate than conventional cameras, and their rate-sensitive characteristics as well as high frame rate characteristics can be used to monitor moving objects.

In the related art, video reconstruction may be performed using dynamic visual information collected by a dynamic visual receptor, i.e., video frames are generated by the dynamic visual information. However, since dynamic vision receptors transmit intensity changes in the form of asynchronous events, the spatiotemporal coordinates of which can deliver intensity changes are naturally sparse, i.e., pixels in each frame of dynamic vision information are sparse, the amount of information is small, and it is difficult to pass single-frame dynamic vision information. For video reconstruction, all dynamic visual information between two video frames may be synthesized into a dynamic visual frame, which in turn is generated based on the dynamic visual frame through a convolutional neural network for generating images or an image generation network after countermeasure training.

However, the image information (e.g., color information, shape information) of the dynamic visual frame, etc. are insufficient, and the generated image may be distorted or over-fitted. In addition, synthesizing all dynamic visual information between two video frames into one dynamic visual frame can directly lose time dimension information, the advantage of high frame rate of a dynamic visual receptor is not fully utilized, and finally, only the light intensity change between the two video frames can be obtained, so that the action and the motion trail of an object between the two video frames are difficult to track.

Disclosure of Invention

The disclosure provides an image processing method and device, electronic equipment and a storage medium.

According to an aspect of the present disclosure, there is provided an image processing method including: dividing dynamic visual information in a first time period of a preset scene according to a first time interval to generate a plurality of dynamic visual images of the preset scene, wherein the first time interval is smaller than a second time interval, and the second time interval is a time interval when a pixel acquisition device acquires a first color image of the preset scene; inputting a first color image and a dynamic visual image in a first time period into an image generation network for processing to obtain second color images respectively corresponding to the dynamic visual images in the first time period; the image generation network is a neural network obtained through authenticity countermeasure training, and the authenticity countermeasure training is used for training the authenticity of the image generated by the image generation network.

In one possible implementation manner, the dividing the dynamic visual information of the preset scene according to the first time interval, and generating a plurality of dynamic visual images of the preset scene includes: dividing the plurality of dynamic visual information in the first time period according to the first time interval to obtain a plurality of dynamic visual information groups; and respectively carrying out fusion processing on the dynamic visual information in the dynamic visual information groups to obtain dynamic visual images corresponding to the dynamic visual information groups.

In one possible implementation, the image generation network includes a first feature extraction sub-network for extracting timing feature information of the dynamic visual image and a second feature extraction sub-network for extracting image feature information of the color image.

In one possible implementation manner, the processing of the first color image and the dynamic visual image in the first period of time by the input image generating network to obtain the second color image corresponding to the dynamic visual image in the first period of time includes: inputting the plurality of dynamic visual images into a first feature extraction sub-network for processing to respectively obtain first feature images of the plurality of dynamic visual images; inputting the first color image into a second feature extraction sub-network for processing to obtain a second feature image; performing feature fusion processing on the first feature image and the second feature image to obtain a third feature image corresponding to the first feature image; and performing image reconstruction processing according to the third feature map to obtain the second color image.

In one possible implementation, the method further includes: inputting a plurality of first sample dynamic visual images and first sample color images into an image generation network for processing to obtain second sample color images, wherein the first sample dynamic visual images are images fused by sample dynamic visual information acquired in a second time period, and the first sample color images are color images acquired in the second time period; and carrying out authenticity countermeasure training on the image generation network according to the first sample color image, the second sample color image and the discrimination network to obtain a trained image generation network.

In one possible implementation, according to the first sample color image, the second sample color image, and a discrimination network, performing an authenticity countermeasure training on the image generation network, and obtaining a trained image generation network includes: inputting the first sample color image or the second sample color image into a discrimination network to obtain a discrimination result; determining discrimination loss according to the discrimination result; adjusting network parameters of the discrimination network and the image generation network according to the discrimination loss; and under the condition that the image generation network and the judging network meet training conditions, obtaining the trained image generation network and the judging network.

According to an aspect of the present disclosure, there is provided an image processing apparatus including: the system comprises a segmentation module, a pixel acquisition device and a display module, wherein the segmentation module is used for segmenting dynamic visual information in a first time period of a preset scene according to a first time interval to generate a plurality of dynamic visual images of the preset scene, the first time interval is smaller than a second time interval, and the second time interval is a time interval when the pixel acquisition device acquires a first color image of the preset scene; the generation module is used for inputting the first color image and the dynamic visual image in the first time period into the image generation network for processing to obtain second color images respectively corresponding to the dynamic visual image in the first time period; the image generation network is a neural network obtained through authenticity countermeasure training, and the authenticity countermeasure training is used for training the authenticity of the image generated by the image generation network.

In one possible implementation, the segmentation module is further configured to: dividing the plurality of dynamic visual information in the first time period according to the first time interval to obtain a plurality of dynamic visual information groups; and respectively carrying out fusion processing on the dynamic visual information in the dynamic visual information groups to obtain dynamic visual images corresponding to the dynamic visual information groups.

In one possible implementation, the generating module is further configured to: inputting the plurality of dynamic visual images into a first feature extraction sub-network for processing to respectively obtain first feature images of the plurality of dynamic visual images; inputting the first color image into a second feature extraction sub-network for processing to obtain a second feature image; performing feature fusion processing on the first feature image and the second feature image to obtain a third feature image corresponding to the first feature image; and performing image reconstruction processing according to the third feature map to obtain the second color image.

In one possible implementation, the apparatus further includes: the training module is used for inputting a plurality of first sample dynamic visual images and first sample color images into the image generation network for processing to obtain second sample color images, wherein the first sample dynamic visual images are images fused by sample dynamic visual information acquired in a second time period, and the first sample color images are color images acquired in the second time period; and carrying out authenticity countermeasure training on the image generation network according to the first sample color image, the second sample color image and the discrimination network to obtain a trained image generation network.

In one possible implementation, the training module is further configured to: inputting the first sample color image or the second sample color image into a discrimination network to obtain a discrimination result; determining discrimination loss according to the discrimination result; adjusting network parameters of the discrimination network and the image generation network according to the discrimination loss; and under the condition that the image generation network and the judging network meet training conditions, obtaining the trained image generation network and the judging network.

According to an aspect of the present disclosure, there is provided an electronic apparatus including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to invoke the instructions stored in the memory to perform the above method.

According to an aspect of the present disclosure, there is provided a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the above-described method.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure. Other features and aspects of the present disclosure will become apparent from the following detailed description of exemplary embodiments, which proceeds with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the technical aspects of the disclosure.

FIG. 1 shows a flow chart of an image processing method according to an embodiment of the present disclosure;

FIGS. 2A and 2B are schematic diagrams illustrating dynamic visual information and color images according to embodiments of the present disclosure;

FIG. 3 shows a schematic diagram of dynamic visual information according to an embodiment of the present disclosure;

FIG. 4 shows a schematic diagram of an image generation network according to an embodiment of the present disclosure;

fig. 5 shows an application schematic of an image processing method according to an embodiment of the present disclosure;

fig. 6 shows a block diagram of an image processing apparatus according to an embodiment of the present disclosure;

FIG. 7 illustrates a block diagram of an electronic device according to an embodiment of the disclosure;

fig. 8 shows a block diagram of an electronic device according to an embodiment of the disclosure.

Detailed Description

Various exemplary embodiments, features and aspects of the disclosure will be described in detail below with reference to the drawings. In the drawings, like reference numbers indicate identical or functionally similar elements. Although various aspects of the embodiments are illustrated in the accompanying drawings, the drawings are not necessarily drawn to scale unless specifically indicated.

The word "exemplary" is used herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.

The term "and/or" is herein merely an association relationship describing an associated object, meaning that there may be three relationships, e.g., a and/or B, may represent: a exists alone, A and B exist together, and B exists alone. In addition, the term "at least one" herein means any one of a plurality or any combination of at least two of a plurality, for example, including at least one of A, B, C, and may mean including any one or more elements selected from the group consisting of A, B and C.

Furthermore, numerous specific details are set forth in the following detailed description in order to provide a better understanding of the present disclosure. It will be understood by those skilled in the art that the present disclosure may be practiced without some of these specific details. In some instances, methods, means, elements, and circuits well known to those skilled in the art have not been described in detail in order not to obscure the present disclosure.

Fig. 1 shows a flowchart of an image processing method according to an embodiment of the present disclosure, as shown in fig. 1, the method including:

in step S11, dividing the dynamic visual information in a first time period of a preset scene according to a first time interval, and generating a plurality of dynamic visual images of the preset scene, wherein the first time interval is smaller than a second time interval, and the second time interval is a time interval when a pixel acquisition device acquires a first color image of the preset scene;

in step S12, inputting the first color image and the dynamic visual image in the first period of time into an image generating network for processing, so as to obtain second color images respectively corresponding to the dynamic visual images in the first period of time; the image generation network is a neural network obtained through authenticity countermeasure training, and the authenticity countermeasure training is used for training the authenticity of the image generated by the image generation network.

According to the image processing method of the embodiment of the disclosure, the plurality of dynamic visual information in the first time period can be segmented to generate the plurality of dynamic visual images, and the plurality of dynamic visual images can be reserved in the first time period, namely, the information of the time dimension is reserved, so that the movement track of the target in the preset scene is clearer, and the track of the moving object is beneficial to tracking. In addition, the first color image and the dynamic visual image can be utilized to generate the second color image, the image generation network used for generating the image is subjected to authenticity countermeasure training, and the generated second color image has higher authenticity, so that the possibility of generating image distortion can be reduced.

In a possible implementation manner, the image processing method may be performed by an electronic device such as a terminal device or a server, where the terminal device may be a User Equipment (UE), a mobile device, a User terminal, a cellular phone, a cordless phone, a personal digital assistant (Personal Digital Assistant, PDA), a handheld device, a computing device, an in-vehicle device, a wearable device, etc., and the method may be implemented by a processor invoking computer readable instructions stored in a memory. Alternatively, the method may be performed by a server.

In one possible implementation, the dynamic vision receptors (Dynamic visual receptors, DVS) are sensitive to the rate of change of light intensity, and each pixel can record the amount of change in light intensity at that pixel location, and when the amount of change exceeds a threshold, a positive or negative pulse, i.e., dynamic visual information, is generated.

For example, an Event Camera (Event Camera) is a dynamic vision receptor that can be used to obtain the rate of change of light intensity of a preset scene. When a target in a preset scene is abnormal or performs certain actions, the light intensity of the target in the event camera can change to a certain extent, and the event camera can acutely capture the change to obtain dynamic visual information.

In one possible implementation, the frame rate of the dynamic vision receptor is higher than that of a normal camera or webcam, e.g., the frame rate of a camera or a conventional webcam is about 100fps, while the frame rate of the dynamic vision receptor is about 1,000,000fps. Therefore, in the time interval between the photographing of two frames of images by a common camera or a video camera, multiple frames of dynamic visual information can be photographed.

In one possible implementation, the amount of information in the single-frame dynamic visual information is small, and the pixel data is sparse. The single-frame dynamic visual information is difficult to generate an image, but the process of fusing all dynamic visual information between two frames of images into one dynamic visual frame to regenerate an image can cause the information loss of the time dimension of multiple frames of dynamic visual information, and the track of a target in a preset scene is difficult to track.

Fig. 2A and 2B are schematic diagrams of dynamic visual information and color images according to an embodiment of the present disclosure, where fig. 2A and 2B are dynamic visual information and color images obtained by photographing the same scene with a dynamic vision receptor and a pixel acquisition device (e.g., a camera or a video camera), respectively, the dynamic visual information includes pulses generated after a variation amount exceeds a threshold value, the amount of information in a single frame of dynamic visual information is small, pixel data is sparse, some image information is missing, e.g., color information is absent, etc., compared to the color image in fig. 2B. But the frame rate at which the dynamic vision receptor captures dynamic vision information is higher than the frame rate at which the pixel capture device captures color images.

For the above-mentioned problems, the dynamic visual information in the first period may be segmented according to the first time interval. That is, the plurality of dynamic visual information within the first time period is grouped at smaller time intervals to preserve information of a partial time dimension.

In an example, the length of the first time period may be equal to a second time interval between the pixel acquisition device (e.g., a camera or a video camera) acquiring two frames of the first color image (e.g., an image or a video frame) of the preset scene, or may be a time period between the acquisition of multiple frames of the first color image of the preset scene. That is, the start-stop time of the first period may be the time when the first color image is acquired.

In another example, the start-stop time of the first period may not be the time when the first color image is acquired, and the length of the first period may be smaller than the period between two frames of the first color image acquired by the pixel acquisition device, and only one frame of the first color image needs to be acquired in the first period. The present disclosure does not limit the length of the first time period and the starting time. For example, the start time of the first period may be before one frame of the first color image is captured, and the end time of the first period may be after one frame of the first color image is captured, and does not necessarily coincide with the time when the first color image is captured.

In an example, the first time interval for dividing the dynamic visual information is smaller than the first time period, i.e., the dynamic visual information acquired in the first time period may be divided into a plurality of groups according to the first time interval. Each set of dynamic visual information may generate one dynamic visual image, so that a plurality of dynamic visual images may be obtained in the first period of time, instead of just one dynamic visual image, the information of the time dimension may be retained in the above manner, i.e., dynamic visual images at a plurality of moments are generated in the first period of time.

In one possible implementation, the length of the first time period is longer than the first time interval, and the dynamic visual image in the first time period may be divided into a plurality of dynamic visual information groups according to the first time interval, and each dynamic visual information group may include a plurality of dynamic visual information therein. And the dynamic visual image can be obtained by fusion processing of a plurality of dynamic visual information in the dynamic visual information group.

In an example, multiple dynamic visual information in each dynamic visual information group may be fused into the first dynamic visual information within the group, i.e., the obtained dynamic visual image retains the temporal information of the first dynamic visual information of the dynamic visual information group. Alternatively, a plurality of dynamic visual information in each dynamic visual information group may be fused into the last dynamic visual information in the group, i.e. the obtained dynamic visual image retains the time information of the last dynamic visual information of the dynamic visual information group. The dynamic visual information can also be fused into other dynamic visual information in the group, and the fusion mode is not limited by the present disclosure.

Fig. 3 illustrates a schematic diagram of dynamic visual information according to an embodiment of the present disclosure, and as illustrated in fig. 3, X-axis and Y-axis are coordinate axes of pixel positions in dynamic visual information of each frame, and Z-axis is time axis. In an example, the time units may be milliseconds or microseconds, etc., and the present disclosure is not limited to time units. In an example, the first time period starts at 0 ms and ends at 12 ms, the first time interval is 2 ms, i.e., the first time period may be divided into 6 groups of dynamic visual information, wherein dynamic visual information between 0 ms and 2 ms may be divided into one group, dynamic visual information between 2 ms and 4 ms may be divided into one group, dynamic visual information between 4 ms and 6 ms may be divided into one group, dynamic visual information between 6 ms and 8 ms may be divided into one group, dynamic visual information between 8 ms and 10 ms may be divided into one group, and dynamic visual information between 10 ms and 12 ms may be divided into one group.

Each group may include a plurality of dynamic visual information, the dynamic visual information in the group may be subjected to fusion processing, for example, the plurality of dynamic visual information in the group may be fused into the dynamic visual information of the first frame in the group to obtain dynamic visual images corresponding to each dynamic visual information group, the time information of the dynamic visual images may be determined as the starting time of the dynamic visual information group, for example, the time information of the dynamic visual images corresponding to the dynamic visual information group of 0 ms-2 ms is 0 ms, the time information of the dynamic visual images corresponding to the dynamic visual information group of 2 ms-4 ms is 2 ms, the time information of the dynamic visual images corresponding to the dynamic visual information group of 4 ms-6 ms is 4 ms, the time information of the dynamic visual images corresponding to the dynamic visual information group of 6 ms-8 ms is 6 ms, the time information of the dynamic visual images corresponding to the dynamic visual information group of 8 ms-10 ms is 8 ms, and the time information of the dynamic visual images corresponding to the dynamic information group of 10 ms-12 ms is 10 ms. The present disclosure does not limit the length of the first time period and the first time interval, nor does it limit the fusion manner.

In this way, the plurality of dynamic visual information in the first time period can be divided into a plurality of groups, so that the fused dynamic visual image retains the time information of the dynamic visual information groups, and the track of the moving object in the preset scene is beneficial to tracking.

In one possible implementation, the color images may be generated based on the dynamic visual images in the first period, and since the number of dynamic visual images in the first period is greater than the number of color images in the first period, generating the color images by the dynamic visual images may increase the number of color images, shorten the time interval between the color images, and facilitate tracking of the target in the preset scene.

In one possible implementation, the dynamic visual frame needs to be downsampled in the image generation process, and the feature information after downsampling is easily distorted due to the sparsity and high noise of the dynamic visual frame. The first color image and the dynamic visual image in the first time period can be processed together, so that not only the time dimension information of the dynamic visual image but also the image information (such as color information, outline information of an object in the image and the like) of the first color image can be obtained, the reality degree of the generated image is improved, and the distortion is reduced.

In one possible implementation, the image generation network includes a first feature extraction sub-network for extracting timing feature information of the dynamic visual image and a second feature extraction sub-network for extracting image feature information of the color image. In the process of generating the color image through the image generation network, time sequence characteristic information of a plurality of dynamic visual images can be extracted through a first characteristic extraction sub-network, for example, the first characteristic extraction sub-network can be a cyclic neural network (Recurrent Neural Network, RNN), downsampling processing such as convolution is not needed through convolution kernels, and the problem of image distortion after downsampling caused by sparseness of the dynamic visual images can be reduced. The second feature extraction sub-network may be a convolutional neural network (Convolutional Neural Networks, CNN) that may be used to extract image feature information (e.g., color information, contour information, etc.) of the first color image. Through the combination of the first characteristic extraction sub-network and the second characteristic extraction sub-network, more complete characteristic information can be obtained, namely, the time sequence characteristic information and the image characteristic information are included, so that the generated image has higher reality degree while having accurate time dimension information.

In one possible implementation, step S12 may include: inputting the plurality of dynamic visual images into a first feature extraction sub-network for processing to respectively obtain first feature images of the plurality of dynamic visual images; inputting the first color image into a second feature extraction sub-network for processing to obtain a second feature image; performing feature fusion processing on the first feature image and the second feature image to obtain a third feature image corresponding to the first feature image; and performing image reconstruction processing according to the third feature map to obtain the second color image.

Fig. 4 illustrates a schematic diagram of an image generation network according to an embodiment of the present disclosure, as illustrated in fig. 4, a first feature extraction sub-network in the image generation network may be a recurrent neural network, and a plurality of dynamic visual images may be processed to obtain a first feature map, which may include timing feature information of each dynamic visual image, for example, may include position information and/or pose information of a target in a preset scene at various times within a first period of time, and the like.

In one possible implementation, the second feature extraction sub-network in the image generation network may be a convolutional neural network, which may be used to extract image features of the first color image, e.g., features of color, outline, etc. In an example, if only one frame of the first color image is included in the first period of time, the first color image may be input into the second feature extraction sub-network. If only a plurality of frames of the first color image are included in the first period, only one frame of the first color image may be input. If the start and stop time of the first time period is the time of obtaining two frames of first color images respectively, the two frames of first color images obtained at the start and stop time can be input into the second feature extraction sub-network, or only any frame of first color image can be input. The present disclosure does not limit the number of input first color images.

In one possible implementation, after obtaining the plurality of first feature maps and the second feature maps, the first feature maps and the second feature maps may be fused. In an example, the first feature map is a feature map corresponding to each of the plurality of dynamic visual images, and the information included in the first feature map includes timing information, position information, and the like. The information contained in the second feature map includes color information, contour information, and the like. The color information, the contour information and other information in the second feature map can be respectively fused into the first feature map, namely, each first feature map also has the color information, the contour information and other information, and a third feature map corresponding to each dynamic visual image is obtained, namely, a plurality of feature maps simultaneously having the time sequence information, the position information, the color information, the contour information and other information.

In one possible implementation, the third feature maps may be subjected to an image reconstruction process, for example, a decoding process such as upsampling, deconvolution, or the like may be performed on each of the third feature maps, to obtain a second color image corresponding to each of the dynamic visual images.

In this way, the time sequence feature information and the image feature information can be obtained through the first feature extraction sub-network and the second feature extraction sub-network respectively, so that the image distortion problem caused by sparseness of pixel information in the dynamic visual image is reduced.

In one possible implementation, a plurality of second color images over a first period of time may be obtained through the above-described processing. That is, if the color image is captured only by the pixel capturing device with a lower frame rate such as a camera, only a small amount of color images can be obtained in the first period, and a plurality of color images can be generated in the first period in the above manner, so that the number of frames of the color images in the first period is increased, the time interval between each frame of color images is smaller, the movement amplitude of the object between frames is smaller, and the tracking of the motion and the position of the object is facilitated.

In one possible implementation, the image generation network may be trained prior to generating the second color image via the image generation network. For example, the image generation network may be trained by generating countermeasure training. In an example, the manner in which the countermeasure training is generated may increase the degree of realism of the image generated by the image generation network, i.e. by generating the countermeasure training, the image generated by the image generation network may be made more realistic.

In one possible implementation, the first sample dynamic visual image and the first sample color image may be used as training samples. The first sample dynamic visual image may be an image in which sample dynamic visual information obtained in the second period of time is fused, and the first sample color image is a color image obtained in the second period of time. The second period of time is similar to the first period of time, and may be a period of time including a time when at least one frame of color image is acquired, and a start-stop time of the second period of time may be coincident with or not coincident with a time when the color image is acquired.

In one possible implementation, the sample dynamic visual information over the second time period may be grouped and each group of sample dynamic visual information may be fused. The fusion method is the same as the above method for fusing the dynamic visual information in the first period, and will not be described herein.

In one possible implementation, the first sample dynamic visual image and the first sample color image may be input to an image generation network to generate a second sample color image. The second sample color image may have errors, i.e., the degree of realism may be insufficient. The image generation network may be subjected to an authenticity challenge training to enhance the authenticity of the image generated by the image generation network.

In one possible implementation, in the authenticity countermeasure training, the first sample color image or the generated second sample color image actually photographed may be input into a discrimination network, and the discrimination network may determine a discrimination result, that is, discriminate the authenticity of the input image. However, the discrimination result of the discrimination network may have an error, that is, the first sample color image obtained by real photographing may be misjudged as the generated image, and the second sample color image generated may be misjudged as the image obtained by real photographing. Discrimination losses may be determined based on the errors described above, which may be used for back propagation to adjust network parameters of the discrimination network and the image generation network. The image generated by the image generation network is more real, and the judging capability of the judging network is improved.

In one possible implementation, the training steps may be performed iteratively, so as to simultaneously improve the reality of the image generated by the image generating network and the discrimination capability of the discrimination network until the image generating network and the discrimination network meet the training condition. For example, the performance of the image generation network and the performance of the discrimination network are balanced by the countermeasure training described above. That is, in the case where the judging capability of the judging network is strong, the second sample color image generated by the image generating network can still make it difficult for the judging network to judge authenticity, that is, make the second sample image generated by the image generating network sufficiently realistic.

In this way, the fidelity of the image generated by the image generating network can be improved through the authenticity countermeasure training, so that the image generating network can generate a color image with enough fidelity, and the tracking of the motion trail of the target is facilitated.

Fig. 5 shows an application schematic of an image processing method according to an embodiment of the present disclosure. The frame rate at which the dynamic vision receptor obtains dynamic vision information is higher than the frame rate at which the pixel acquisition device (e.g., camera, etc.) acquires color images. In the process of tracking the target in the preset scene, the time interval between the color images shot by the common pixel acquisition equipment is larger, and the target with higher movement speed cannot be accurately tracked. Therefore, dynamic visual information of each group can be grouped and fused to generate dynamic visual images. Further, the dynamic visual image can be processed by using the image generation network to generate color images so as to insert the color images in the time interval between the color images shot by the pixel acquisition equipment, so that the time interval between the color images is reduced after the insertion processing, the movement amplitude of the target in the time interval is reduced, and the efficiency of tracking the moving target is improved.

In one possible implementation, the image generation network may perform an authenticity countermeasure training to enhance the fidelity with which the image generation network generates the image. In an example, a first sample dynamic visual image may be input to a first feature extraction sub-network of an image generation network to extract timing feature information. And inputting the first sample color image into a second feature extraction sub-network of the image generation network to extract image feature information (e.g., feature information such as color, outline, etc.). Further, the features extracted by the two feature extraction sub-networks may be fused and decoded to generate a second sample dynamic visual image.

In an example, during the authenticity countermeasure training, the second sample dynamic visual image or the first sample dynamic visual image may be input into the discrimination network, the discrimination loss is determined by the discrimination result row of the discrimination network, and the discrimination loss is counter-propagated to adjust network parameters of the image generation network and the discrimination network, to improve the fidelity of the image generated by the image generation network, and to enhance the discrimination capability of the discrimination network.

The image generated by the trained image generation network has high enough fidelity and can be used in an actual scene for tracking a moving target, and the application field of the image processing method is not limited.

According to the image processing method of the embodiment of the disclosure, the plurality of dynamic visual information in the first time period can be segmented, so that the fused dynamic visual image retains the time information of the dynamic visual information group, and the track of the moving object is tracked. And the time sequence characteristic information and the image characteristic information are respectively obtained through the first characteristic extraction sub-network and the second characteristic extraction sub-network, so that the image distortion problem caused by sparseness of pixel information in the dynamic visual image is reduced. In addition, the first color image and the dynamic visual image can be utilized to generate the second color image, an image generation network used for generating the image is subjected to authenticity countermeasure training, and the generated second color image has higher authenticity, so that tracking of the motion trail of the target is facilitated.

Fig. 6 shows a block diagram of an image processing apparatus according to an embodiment of the present disclosure, as shown in fig. 6, the apparatus includes: the segmentation module 11 is configured to segment dynamic visual information in a first time period of a preset scene according to a first time interval, and generate a plurality of dynamic visual images of the preset scene, where the first time interval is smaller than a second time interval, and the second time interval is a time interval during which a pixel acquisition device acquires a first color image of the preset scene; the generating module 12 is configured to input a first color image and a dynamic visual image in a first period of time into the image generating network for processing, so as to obtain second color images respectively corresponding to the dynamic visual images in the first period of time; the image generation network is a neural network obtained through authenticity countermeasure training, and the authenticity countermeasure training is used for training the authenticity of the image generated by the image generation network.

It will be appreciated that the above-mentioned method embodiments of the present disclosure may be combined with each other to form a combined embodiment without departing from the principle logic, and are limited to the description of the present disclosure. It will be appreciated by those skilled in the art that in the above-described methods of the embodiments, the particular order of execution of the steps should be determined by their function and possible inherent logic.

In addition, the disclosure further provides an image processing apparatus, an electronic device, a computer readable storage medium, and a program, where the foregoing may be used to implement any one of the image processing methods provided in the disclosure, and corresponding technical schemes and descriptions and corresponding descriptions referring to method parts are not repeated.

In some embodiments, functions or modules included in an apparatus provided by the embodiments of the present disclosure may be used to perform a method described in the foregoing method embodiments, and specific implementations thereof may refer to descriptions of the foregoing method embodiments, which are not repeated herein for brevity.

The disclosed embodiments also provide a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the above-described method. The computer readable storage medium may be a non-volatile computer readable storage medium.

The embodiment of the disclosure also provides an electronic device, which comprises: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to invoke the instructions stored in the memory to perform the above method.

The disclosed embodiments also provide a computer program product comprising computer readable code which, when run on a device, causes a processor in the device to execute instructions for implementing the image processing method as provided in any of the embodiments above.

The disclosed embodiments also provide another computer program product for storing computer readable instructions that, when executed, cause a computer to perform the operations of the image processing method provided in any of the above embodiments.

The electronic device may be provided as a terminal, server or other form of device.

Fig. 7 illustrates a block diagram of an electronic device 800, according to an embodiment of the disclosure. For example, electronic device 800 may be a mobile phone, computer, digital broadcast terminal, messaging device, game console, tablet device, medical device, exercise device, personal digital assistant, or the like.

Referring to fig. 7, an electronic device 800 may include one or more of the following components: a processing component 802, a memory 804, a power component 806, a multimedia component 808, an audio component 810, an input/output (I/O) interface 812, a sensor component 814, and a communication component 816.

The processing component 802 generally controls overall operation of the electronic device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing component 802 may include one or more processors 820 to execute instructions to perform all or part of the steps of the methods described above. Further, the processing component 802 can include one or more modules that facilitate interactions between the processing component 802 and other components. For example, the processing component 802 can include a multimedia module to facilitate interaction between the multimedia component 808 and the processing component 802.

The memory 804 is configured to store various types of data to support operations at the electronic device 800. Examples of such data include instructions for any application or method operating on the electronic device 800, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 804 may be implemented by any type or combination of volatile or nonvolatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.

The power supply component 806 provides power to the various components of the electronic device 800. The power components 806 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for the electronic device 800.

The multimedia component 808 includes a screen between the electronic device 800 and the user that provides an output interface. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from a user. The touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensor may sense not only an edge of a touch or slide action, but also a duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 808 includes a front camera and/or a rear camera. When the electronic device 800 is in an operational mode, such as a shooting mode or a video mode, the front camera and/or the rear camera may receive external multimedia data. Each front camera and rear camera may be a fixed optical lens system or have focal length and optical zoom capabilities.

The audio component 810 is configured to output and/or input audio signals. For example, the audio component 810 includes a Microphone (MIC) configured to receive external audio signals when the electronic device 800 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may be further stored in the memory 804 or transmitted via the communication component 816. In some embodiments, audio component 810 further includes a speaker for outputting audio signals.

The I/O interface 812 provides an interface between the processing component 802 and peripheral interface modules, which may be a keyboard, click wheel, buttons, etc. These buttons may include, but are not limited to: homepage button, volume button, start button, and lock button.

The sensor assembly 814 includes one or more sensors for providing status assessment of various aspects of the electronic device 800. For example, the sensor assembly 814 may detect an on/off state of the electronic device 800, a relative positioning of the components, such as a display and keypad of the electronic device 800, the sensor assembly 814 may also detect a change in position of the electronic device 800 or a component of the electronic device 800, the presence or absence of a user's contact with the electronic device 800, an orientation or acceleration/deceleration of the electronic device 800, and a change in temperature of the electronic device 800. The sensor assembly 814 may include a proximity sensor configured to detect the presence of nearby objects without any physical contact. The sensor assembly 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscopic sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 816 is configured to facilitate communication between the electronic device 800 and other devices, either wired or wireless. The electronic device 800 may access a wireless network based on a communication standard, such as WiFi,2G, or 3G, or a combination thereof. In one exemplary embodiment, the communication component 816 receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel. In one exemplary embodiment, the communication component 816 further includes a Near Field Communication (NFC) module to facilitate short range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the electronic device 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), controllers, microcontrollers, microprocessors, or other electronic elements for executing the methods described above.

In an exemplary embodiment, a non-transitory computer readable storage medium is also provided, such as memory 804 including computer program instructions executable by processor 820 of electronic device 800 to perform the above-described methods.

Fig. 8 illustrates a block diagram of an electronic device 1900 according to an embodiment of the disclosure. For example, electronic device 1900 may be provided as a server. Referring to fig. 8, electronic device 1900 includes a processing component 1922 that further includes one or more processors and memory resources represented by memory 1932 for storing instructions, such as application programs, that can be executed by processing component 1922. The application programs stored in memory 1932 may include one or more modules each corresponding to a set of instructions. Further, processing component 1922 is configured to execute instructions to perform the methods described above.

The electronic device 1900 may also include a power component 1926 configured to perform power management of the electronic device 1900, a wired or wireless network interface 1950 configured to connect the electronic device 1900 to a network, and an input/output (I/O) interface 1958. The electronic device 1900 may operate an operating system based on a memory 1932, such as Windows Server ^TM ，Mac OS X ^TM ，Unix ^TM ,Linux ^TM ，FreeBSD ^TM Or the like.

In an exemplary embodiment, a non-transitory computer readable storage medium is also provided, such as memory 1932, including computer program instructions executable by processing component 1922 of electronic device 1900 to perform the methods described above.

The present disclosure may be a system, method, and/or computer program product. The computer program product may include a computer readable storage medium having computer readable program instructions embodied thereon for causing a processor to implement aspects of the present disclosure.

The computer readable storage medium may be a tangible device that can hold and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: portable computer disks, hard disks, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), static Random Access Memory (SRAM), portable compact disk read-only memory (CD-ROM), digital Versatile Disks (DVD), memory sticks, floppy disks, mechanical coding devices, punch cards or in-groove structures such as punch cards or grooves having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media, as used herein, are not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., optical pulses through fiber optic cables), or electrical signals transmitted through wires.

The computer readable program instructions described herein may be downloaded from a computer readable storage medium to a respective computing/processing device or to an external computer or external storage device over a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmissions, wireless transmissions, routers, firewalls, switches, gateway computers and/or edge servers. The network interface card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium in the respective computing/processing device.

Computer program instructions for performing the operations of the present disclosure can be assembly instructions, instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, c++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer readable program instructions may be executed entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, aspects of the present disclosure are implemented by personalizing electronic circuitry, such as programmable logic circuitry, field Programmable Gate Arrays (FPGAs), or Programmable Logic Arrays (PLAs), with state information of computer readable program instructions, which can execute the computer readable program instructions.

Various aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable medium having the instructions stored therein includes an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The computer program product may be realized in particular by means of hardware, software or a combination thereof. In an alternative embodiment, the computer program product is embodied as a computer storage medium, and in another alternative embodiment, the computer program product is embodied as a software product, such as a software development kit (Software Development Kit, SDK), or the like.

The foregoing description of the embodiments of the present disclosure has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various embodiments described. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or the improvement of technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. An image processing method, comprising:

dividing dynamic visual information in a first time period of a preset scene according to a first time interval to generate a plurality of dynamic visual images of the preset scene, wherein the first time interval is smaller than a second time interval, and the second time interval is a time interval when a pixel acquisition device acquires a first color image of the preset scene;

Inputting a first color image and a dynamic visual image in a first time period into an image generation network for processing to obtain second color images respectively corresponding to the dynamic visual images in the first time period; the image generation network is a neural network obtained through authenticity countermeasure training, and the authenticity countermeasure training is used for training the authenticity of the image generated by the image generation network;

inputting a first color image and a dynamic visual image in a first time period into an image generation network for processing to obtain second color images respectively corresponding to the dynamic visual image in the first time period, wherein the method comprises the following steps of:

inputting the plurality of dynamic visual images into a first feature extraction sub-network for processing to respectively obtain first feature images of the plurality of dynamic visual images;

inputting the first color image into a second feature extraction sub-network for processing to obtain a second feature image;

performing feature fusion processing on the first feature image and the second feature image to obtain a third feature image corresponding to the first feature image;

performing image reconstruction processing according to the third feature map to obtain the second color image;

The method further comprises the steps of:

inputting a plurality of first sample dynamic visual images and first sample color images into an image generation network for processing to obtain second sample color images, wherein the first sample dynamic visual images are images fused by sample dynamic visual information acquired in a second time period, and the first sample color images are color images acquired in the second time period;

performing authenticity countermeasure training on the image generation network according to the first sample color image, the second sample color image and the discrimination network to obtain a trained image generation network;

performing an authenticity countermeasure training on the image generation network according to the first sample color image, the second sample color image and the discrimination network to obtain a trained image generation network, including:

inputting the first sample color image or the second sample color image into a discrimination network to obtain a discrimination result;

determining discrimination loss according to the discrimination result;

adjusting network parameters of the discrimination network and the image generation network according to the discrimination loss;

and under the condition that the image generation network and the judging network meet training conditions, obtaining the trained image generation network and the judging network.

2. The method of claim 1, wherein segmenting the dynamic visual information of the preset scene according to the first time interval generates a plurality of dynamic visual images of the preset scene, comprising:

dividing the plurality of dynamic visual information in the first time period according to the first time interval to obtain a plurality of dynamic visual information groups;

and respectively carrying out fusion processing on the dynamic visual information in the dynamic visual information groups to obtain dynamic visual images corresponding to the dynamic visual information groups.

3. The method of claim 1, wherein the image generation network comprises a first feature extraction sub-network for extracting timing feature information of the dynamic visual image and a second feature extraction sub-network for extracting image feature information of the color image.

4. An image processing apparatus, comprising:

the system comprises a segmentation module, a pixel acquisition device and a display module, wherein the segmentation module is used for segmenting dynamic visual information in a first time period of a preset scene according to a first time interval to generate a plurality of dynamic visual images of the preset scene, the first time interval is smaller than a second time interval, and the second time interval is a time interval when the pixel acquisition device acquires a first color image of the preset scene;

The generation module is used for inputting the first color image and the dynamic visual image in the first time period into the image generation network for processing to obtain second color images respectively corresponding to the dynamic visual image in the first time period; the image generation network is a neural network obtained through authenticity countermeasure training, and the authenticity countermeasure training is used for training the authenticity of the image generated by the image generation network;

the generation module is further configured to: inputting the plurality of dynamic visual images into a first feature extraction sub-network for processing to respectively obtain first feature images of the plurality of dynamic visual images; inputting the first color image into a second feature extraction sub-network for processing to obtain a second feature image; performing feature fusion processing on the first feature image and the second feature image to obtain a third feature image corresponding to the first feature image; performing image reconstruction processing according to the third feature map to obtain the second color image;

the apparatus further comprises: the training module is used for inputting a plurality of first sample dynamic visual images and first sample color images into the image generation network for processing to obtain second sample color images, wherein the first sample dynamic visual images are images fused by sample dynamic visual information acquired in a second time period, and the first sample color images are color images acquired in the second time period; performing authenticity countermeasure training on the image generation network according to the first sample color image, the second sample color image and the discrimination network to obtain a trained image generation network;

The training module is further to: inputting the first sample color image or the second sample color image into a discrimination network to obtain a discrimination result; determining discrimination loss according to the discrimination result; adjusting network parameters of the discrimination network and the image generation network according to the discrimination loss; and under the condition that the image generation network and the judging network meet training conditions, obtaining the trained image generation network and the judging network.

5. The apparatus of claim 4, wherein the segmentation module is further configured to:

6. An electronic device, comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to invoke the instructions stored in the memory to perform the method of any of claims 1 to 3.

7. A computer readable storage medium having stored thereon computer program instructions, which when executed by a processor, implement the method of any of claims 1 to 3.