CN118101856A - Image processing method and electronic equipment - Google Patents
Image processing method and electronic equipment Download PDFInfo
- Publication number
- CN118101856A CN118101856A CN202410339775.5A CN202410339775A CN118101856A CN 118101856 A CN118101856 A CN 118101856A CN 202410339775 A CN202410339775 A CN 202410339775A CN 118101856 A CN118101856 A CN 118101856A
- Authority
- CN
- China
- Prior art keywords
- image
- target
- motion
- training
- model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000003672 processing method Methods 0.000 title claims abstract description 27
- 238000012545 processing Methods 0.000 claims abstract description 260
- 238000000034 method Methods 0.000 claims abstract description 185
- 230000033001 locomotion Effects 0.000 claims description 580
- 238000012549 training Methods 0.000 claims description 452
- 238000006073 displacement reaction Methods 0.000 claims description 190
- 230000008569 process Effects 0.000 claims description 112
- 238000000605 extraction Methods 0.000 claims description 90
- 238000003062 neural network model Methods 0.000 claims description 16
- 238000003860 storage Methods 0.000 claims description 8
- 238000004590 computer program Methods 0.000 claims description 7
- 230000000875 corresponding effect Effects 0.000 description 201
- 238000009792 diffusion process Methods 0.000 description 122
- 230000006835 compression Effects 0.000 description 54
- 238000007906 compression Methods 0.000 description 54
- 230000006870 function Effects 0.000 description 43
- 238000013528 artificial neural network Methods 0.000 description 33
- 230000009466 transformation Effects 0.000 description 32
- 238000013527 convolutional neural network Methods 0.000 description 28
- 238000010586 diagram Methods 0.000 description 15
- 239000002245 particle Substances 0.000 description 15
- 239000013598 vector Substances 0.000 description 15
- 230000004913 activation Effects 0.000 description 14
- 230000008859 change Effects 0.000 description 14
- 238000009826 distribution Methods 0.000 description 14
- 238000013135 deep learning Methods 0.000 description 13
- 239000011159 matrix material Substances 0.000 description 13
- 210000002569 neuron Anatomy 0.000 description 13
- 230000003287 optical effect Effects 0.000 description 12
- 230000006837 decompression Effects 0.000 description 11
- 230000001537 neural effect Effects 0.000 description 11
- 238000011176 pooling Methods 0.000 description 11
- 230000002123 temporal effect Effects 0.000 description 9
- 210000000746 body region Anatomy 0.000 description 8
- 238000007781 pre-processing Methods 0.000 description 8
- 238000004422 calculation algorithm Methods 0.000 description 7
- 238000007726 management method Methods 0.000 description 7
- 230000000007 visual effect Effects 0.000 description 7
- 238000013473 artificial intelligence Methods 0.000 description 6
- 238000004364 calculation method Methods 0.000 description 5
- 238000004891 communication Methods 0.000 description 5
- 238000013500 data storage Methods 0.000 description 5
- 238000010801 machine learning Methods 0.000 description 5
- 230000002441 reversible effect Effects 0.000 description 5
- 238000005070 sampling Methods 0.000 description 5
- 230000003068 static effect Effects 0.000 description 5
- 238000013507 mapping Methods 0.000 description 4
- 230000001360 synchronised effect Effects 0.000 description 4
- 230000026676 system process Effects 0.000 description 4
- URINYHFMLIGJCL-UHFFFAOYSA-N NCSN Chemical compound NCSN URINYHFMLIGJCL-UHFFFAOYSA-N 0.000 description 3
- 230000009471 action Effects 0.000 description 3
- 230000008878 coupling Effects 0.000 description 3
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- 238000013461 design Methods 0.000 description 3
- 238000001514 detection method Methods 0.000 description 3
- 239000000284 extract Substances 0.000 description 3
- 230000010355 oscillation Effects 0.000 description 3
- 230000002829 reductive effect Effects 0.000 description 3
- 238000009877 rendering Methods 0.000 description 3
- 238000001228 spectrum Methods 0.000 description 3
- 238000000844 transformation Methods 0.000 description 3
- 230000001131 transforming effect Effects 0.000 description 3
- LJROKJGQSPMTKB-UHFFFAOYSA-N 4-[(4-hydroxyphenyl)-pyridin-2-ylmethyl]phenol Chemical compound C1=CC(O)=CC=C1C(C=1N=CC=CC=1)C1=CC=C(O)C=C1 LJROKJGQSPMTKB-UHFFFAOYSA-N 0.000 description 2
- 229920001621 AMOLED Polymers 0.000 description 2
- 241000282412 Homo Species 0.000 description 2
- 230000033228 biological regulation Effects 0.000 description 2
- 238000013480 data collection Methods 0.000 description 2
- 230000007423 decrease Effects 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 230000002452 interceptive effect Effects 0.000 description 2
- 239000004973 liquid crystal related substance Substances 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000010606 normalization Methods 0.000 description 2
- PXFBZOLANLWPMH-UHFFFAOYSA-N 16-Epiaffinine Natural products C1C(C2=CC=CC=C2N2)=C2C(=O)CC2C(=CC)CN(C)C1C2CO PXFBZOLANLWPMH-UHFFFAOYSA-N 0.000 description 1
- 230000001133 acceleration Effects 0.000 description 1
- 230000008485 antagonism Effects 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 238000013475 authorization Methods 0.000 description 1
- 238000005452 bending Methods 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 210000000988 bone and bone Anatomy 0.000 description 1
- 238000005094 computer simulation Methods 0.000 description 1
- 239000004020 conductor Substances 0.000 description 1
- 230000001276 controlling effect Effects 0.000 description 1
- 230000002596 correlated effect Effects 0.000 description 1
- 230000003247 decreasing effect Effects 0.000 description 1
- 230000008451 emotion Effects 0.000 description 1
- 230000002996 emotional effect Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 230000005284 excitation Effects 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- 238000009499 grossing Methods 0.000 description 1
- 238000003709 image segmentation Methods 0.000 description 1
- 238000003384 imaging method Methods 0.000 description 1
- 230000001939 inductive effect Effects 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 230000001788 irregular Effects 0.000 description 1
- 238000005304 joining Methods 0.000 description 1
- 238000010295 mobile communication Methods 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 230000003534 oscillatory effect Effects 0.000 description 1
- 230000036961 partial effect Effects 0.000 description 1
- 230000000737 periodic effect Effects 0.000 description 1
- 239000002096 quantum dot Substances 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 230000001105 regulatory effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000029058 respiratory gaseous exchange Effects 0.000 description 1
- 230000033764 rhythmic process Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000010408 sweeping Methods 0.000 description 1
- 238000003786 synthesis reaction Methods 0.000 description 1
- 230000036962 time dependent Effects 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N5/00—Details of television systems
- H04N5/14—Picture signal circuitry for video frequency region
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/43—Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
- H04N21/439—Processing of audio elementary streams
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/43—Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
- H04N21/44—Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Signal Processing (AREA)
- Multimedia (AREA)
- General Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Biophysics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Processing Or Creating Images (AREA)
Abstract
The embodiment of the application is applied to the field of image processing and provides an image processing method and electronic equipment. The method comprises the following steps: acquiring an image to be processed; according to the image to be processed, N target images in the target video and N target audios in the target video are generated, the N target images are in one-to-one correspondence with the N target audios, N is an integer greater than 1, and each target audio is used for playing under the condition that the target image corresponding to the target audio is displayed. Based on the technical method provided by the application, the coordination between the content of the played target audio and the content of the displayed target image is higher, and the user experience is improved.
Description
Technical Field
The present application relates to the field of image processing, and more particularly, to an image processing method and an electronic device.
Background
The nature is always in motion. Motion is one of the most attractive visual signals, to which humans are very sensitive. And generating a dynamic diagram according to the image, so that user experience can be improved.
Meanwhile, various sounds exist in the nature, and audio is matched for the moving pictures, so that the moving pictures can be displayed and simultaneously played, emotion expression is enhanced, and the displayed content can have more interactivity and attraction. But the sound matching the moving picture may not be coordinated with the content of the image in the displayed moving picture, affecting the user experience.
Disclosure of Invention
The application provides an image processing method and electronic equipment, which can improve the content coordination of played sound and displayed images and improve user experience.
In a first aspect, there is provided an image processing method including: acquiring an image to be processed; generating N target images in a target video and N target audios in the target video according to the image to be processed, wherein the N target images are in one-to-one correspondence with the N target audios, N is an integer greater than 1, and each target audio is used for playing under the condition that the target image corresponding to the target audio is displayed.
According to the method provided by the application, a plurality of target images are generated according to the image to be processed, and the audio corresponding to each target image is generated, so that the coordination between the played target audio and the content of the displayed target image is higher, and the user experience is improved.
In some possible implementations, the generating N target images in a target video and N target audios in the target video according to the image to be processed includes: and processing a plurality of first images sequentially by using an image processing system to obtain at least one second image corresponding to each first image and the target audio corresponding to each second image, wherein the plurality of target images comprise at least one second image corresponding to each first image, the at least one second image corresponding to each first image is an image after the first image in the target video, the first image processed by the image processing system is the image to be processed, the ith first image processed by the image processing system is an image in at least one second image corresponding to the ith-1 first image processed by the image processing system, i is a positive integer greater than 1, and the image processing system comprises a neural network model obtained through training.
By taking the second image obtained by the last processing as the first image processed by the image processing system next time, under the condition that the duration of the target video is longer, the generated image can be prevented from being offset or diverging, so that the target image in the target video has good image quality, and the user experience is improved.
In some possible implementations, the method further includes: acquiring a target main body indicated by a user in the image to be processed; and when the first image processed by the image processing system is the image to be processed, the image processing system is used for processing the first image and target main body area information to obtain at least one second image corresponding to the first image and the target audio corresponding to each second image, the target main body area information represents a target main body area, the target main body is recorded in the target main body area in the first image, and the content recorded in other areas except the target main body area in each second image is the same as the content recorded in the other areas in the first image.
The portion outside the target subject region where the target subject is located does not change in the second image compared to the first image. Therefore, the motion recorded in the target video is performed according to the instruction of the user, the participation of the user to the video generation is improved, and the user experience is improved.
In some possible implementations, in a case that the first image processed by the image processing system is the image to be processed, the image processing system is configured to process the first image and target subject area information to obtain at least one second image corresponding to the first image, and the target audio corresponding to each second image.
In the case that the first image is a first image, the image processing system processes the first image and the target subject area information; in the case where the first image processed by the image processing system is not the first image, the image processing system may process the first image, and the target subject area information may no longer be input to the image processing system. Therefore, the motion of the target main body and the motion of other main bodies, which are affected by the target main body, can be reflected in the target video, so that the motion recorded in the target video is more reasonable, and the user experience is improved.
In some possible implementations, the method further includes: acquiring a target movement trend of the target main body indicated by the user; and under the condition that the first image processed by the image processing system is the image to be processed, the image processing system is used for processing the first image, target main body area information and motion information to obtain the target image corresponding to each second moment in the at least one second moment and the target audio corresponding to each second moment, and the target main body in at least one target image corresponding to the at least one second moment moves according to the target motion trend represented by the motion information.
The image processing system generates a target video according to the target main body indicated by the user and the target motion trend of the target main body, the motion of the target main body in the target video is carried out according to the target motion trend, the participation of the user on video generation is improved, and the user experience is improved.
In some possible implementations, the image processing system includes a feature prediction model, an image generation model, and an audio generation model; the feature prediction model is used for processing the first image to obtain at least one prediction feature, and second moments corresponding to different prediction features are different, wherein each second moment is a moment after a first moment corresponding to the first image in the target video; the image generation model is used for respectively processing the at least one prediction feature to obtain a second image corresponding to each second moment; the audio generation model is used for respectively processing the at least one prediction feature to obtain target audio corresponding to each second moment.
The image processing system processes the first image to obtain a prediction feature corresponding to the second moment, and generates target audio corresponding to the second moment and a second image corresponding to the second moment according to the prediction feature corresponding to the second moment, so that the coordination of the target audio corresponding to the same moment and the content of the target image is higher, and the user experience is improved.
In some possible implementations, the feature prediction model includes a motion displacement field prediction model, a motion feature extraction model, an image feature extraction model, and an adjustment model; the motion displacement field prediction model is used for processing the first image to obtain a motion displacement field corresponding to each second moment, and the motion displacement field corresponding to each second moment represents the displacement of a plurality of pixels in the first image relative to the first moment at the second moment; the motion feature extraction model is used for extracting features of the at least one motion displacement field respectively to obtain motion features corresponding to each second moment; the image feature extraction model is used for extracting features of the first image to obtain image features; the adjustment model is used for adjusting the image characteristics according to the motion characteristics corresponding to each second moment so as to obtain the prediction characteristics corresponding to the second moment.
The motion displacement field used for representing the displacement of the plurality of pixels in the first image between the first moment and the second moment is predicted, and the image characteristics of the first image are adjusted according to the motion characteristics of the motion displacement field, so that the predicted characteristics are more accurate.
In some possible implementations, the method further includes: acquiring a target main body indicated by a user in the image to be processed; the motion displacement field prediction model is used for processing the first image and target main body area information to obtain the motion displacement field corresponding to each second moment, the target main body area information represents a target main body area, and the target main body is recorded in the target main body area in the image to be processed; and the motion displacement field corresponding to each second moment represents that the displacement of the pixel outside the area is 0, and the pixel outside the area is positioned outside the target main body area in the first image.
In some possible implementations, the method further includes: acquiring a target movement trend of the target main body indicated by a user; and under the condition that the first image processed by the image processing system is the image to be processed, the motion displacement field prediction model is used for processing the first image, the target main body area information and the motion information to obtain the motion displacement field corresponding to each second moment, the target motion trend represented by the motion displacement field corresponding to each second moment is met by the target main body pixel corresponding to each second moment, and the target main body pixel is the pixel positioned on the target main body in the first image.
In a second aspect, there is provided a training method of an image processing system, the method comprising: acquiring a training sample, a tag image and tag audio, wherein the training sample comprises a sample image in a training video, the tag image is an image after the sample image in the training video, and the tag audio is audio in the training video when the tag image in the training video is displayed; processing the training sample by using an initial image processing system to obtain a training image and training audio; and adjusting parameters of the initial image processing system according to the first difference between the training image and the label image and the second difference between the training audio and the label audio, wherein the adjusted initial image processing system is the image processing system obtained by training.
In some possible implementations, the training sample may further include training subject area information, where the training subject area information indicates a training subject area, and a training target subject is recorded in the training subject area in the sample image, and content recorded in other areas, except for the training subject area, in the tag image is the same as content recorded in the other areas in the sample image.
In some possible implementations, where the sample image is a first frame image in the training video, the training sample includes training subject region information.
In some possible implementations, the training samples include training exercise information. The training target subject in the tag image is moved according to the training movement tendency indicated by the training movement information, compared to the sample image.
For example, where the sample image is a first frame image in a training video, the training sample may include training motion information.
In some possible implementations, the initial image processing system includes an initial feature prediction model, an initial image generation model, and an initial audio generation model. The initial feature prediction model is used for processing the training samples to obtain training prediction features. The initial image generation model is used for processing the training prediction features to obtain a training image. The initial audio generation model is used to process the predicted features to obtain training audio. The initial characteristic prediction model after parameter adjustment is a characteristic prediction model in the image processing system, the initial image generation model after parameter adjustment is an image generation model in the image processing system, and the initial audio generation model after parameter adjustment is an audio generation model in the image processing system.
In some possible implementations, the initial feature prediction model includes a motion displacement field prediction model, an initial motion feature extraction model, an initial image feature extraction model, and an adjustment model. The motion displacement field prediction model is used for processing training samples to obtain a training motion displacement field, and the training motion displacement field represents the displacement of a plurality of training pixels in a sample image at a second training moment corresponding to a label image relative to a first training moment corresponding to the sample image. The second training time corresponding to the label image and the first training time corresponding to the sample image can be respectively understood as the time of the label image and the sample image in the training video. The initial motion feature extraction model is used for extracting features of the motion displacement field to obtain training motion features. The initial image feature extraction model is used for extracting features of the sample image to obtain training image features. The adjustment model is used for adjusting the training image characteristics according to the training motion characteristics so as to obtain training prediction characteristics. The initial motion feature extraction model after parameter adjustment is a motion feature extraction model in an image processing system, and the initial image feature extraction model after parameter adjustment is an image feature extraction model in the image processing system.
In a third aspect, an image processing apparatus is provided comprising means for performing the method of the first and/or second aspects. The device can be a terminal device or a chip in the terminal device.
In a fourth aspect, an electronic device is provided that includes one or more processors, and a memory; the memory is coupled with the one or more processors, the memory for storing computer program code comprising computer instructions that the one or more processors invoke the computer instructions to cause the electronic device to perform the method of the first aspect and/or the second aspect.
In a fifth aspect, there is provided a chip system for application to an electronic device, the chip system comprising one or more processors for invoking computer instructions to cause the electronic device to perform the method of the first and/or second aspects.
In a sixth aspect, there is provided a computer readable storage medium comprising instructions which, when run on an electronic device, cause the electronic device to perform the method of the first and/or second aspects.
In a seventh aspect, there is provided a computer program product for, when run on an electronic device, causing the electronic device to perform the method of the first and/or second aspects.
Drawings
FIG. 1 is a schematic diagram of a hardware system of an electronic device suitable for use with the present application;
FIG. 2 is a schematic diagram of a software system suitable for use with an electronic device of the present application;
FIG. 3 is a schematic flow chart of an image processing method provided by an embodiment of the present application;
FIG. 4 is a schematic block diagram of an image processing system according to an embodiment of the present application;
FIG. 5 is a schematic diagram of the principles of generative models to which embodiments of the present application are applicable;
FIG. 6 is a schematic block diagram of a random motion displacement field prediction model provided by an embodiment of the present application;
FIG. 7 is a schematic diagram of the position of a pixel in a video over time;
FIG. 8 is a schematic diagram of a hidden diffusion model according to an embodiment of the present application;
FIG. 9 is a schematic flow chart of a training method of an image processing system according to an embodiment of the present application;
FIG. 10 is a schematic block diagram of a system architecture provided by an embodiment of the present application;
FIGS. 11-14 are schematic diagrams of graphical user interfaces provided by embodiments of the present application;
Fig. 15 is a schematic structural diagram of an image processing apparatus provided by the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be described below with reference to the accompanying drawings.
Fig. 1 shows a hardware system suitable for use in the electronic device of the application.
The method provided by the embodiment of the application can be applied to various electronic devices capable of performing image processing, such as mobile phones, tablet computers, wearable devices, notebook computers, netbooks, personal digital assistants (personal DIGITAL ASSISTANT) and the like, and the embodiment of the application does not limit the specific types of the electronic devices.
Fig. 1 shows a schematic configuration of an electronic device 100. The electronic device 100 may include a processor 110, an external memory interface 120, an internal memory 121, a universal serial bus (universal serial bus, USB) interface 130, a charge management module 140, a power management module 141, a battery 142, an antenna 1, an antenna 2, a mobile communication module 150, a wireless communication module 160, an audio module 170, a speaker 170A, a receiver 170B, a microphone 170C, an earphone interface 170D, a sensor module 180, keys 190, a motor 191, an indicator 192, a camera 193, a display 194, and a subscriber identity module (subscriber identification module, SIM) card interface 195, etc. The sensor module 180 may include a pressure sensor 180A, a gyro sensor 180B, an air pressure sensor 180C, a magnetic sensor 180D, an acceleration sensor 180E, a distance sensor 180F, a proximity sensor 180G, a fingerprint sensor 180H, a temperature sensor 180J, a touch sensor 180K, an ambient light sensor 180L, a bone conduction sensor 180M, and the like.
It should be understood that the illustrated structure of the embodiment of the present application does not constitute a specific limitation on the electronic device 100. In other embodiments of the application, electronic device 100 may include more or fewer components than shown, or certain components may be combined, or certain components may be split, or different arrangements of components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.
The processor 110 may include one or more processing units, such as: the processor 110 may include an application processor (application processor, AP), a modem processor, a graphics processor (graphics processing unit, GPU), an image signal processor (IMAGE SIGNAL processor, ISP), a controller, a memory, a video codec, a digital signal processor (DIGITAL SIGNAL processor, DSP), a baseband processor, and/or a neural Network Processor (NPU), etc. Wherein the different processing units may be separate devices or may be integrated in one or more processors.
The controller may be a neural hub and a command center of the electronic device 100, among others. The controller can generate operation control signals according to the instruction operation codes and the time sequence signals to finish the control of instruction fetching and instruction execution.
A memory may also be provided in the processor 110 for storing instructions and data. In some embodiments, the memory in the processor 110 is a cache memory. The memory may hold instructions or data that the processor 110 has just used or recycled. If the processor 110 needs to reuse the instruction or data, it can be called directly from the memory. Repeated accesses are avoided and the latency of the processor 110 is reduced, thereby improving the efficiency of the system.
The electronic device 100 implements display functions through a GPU, a display screen 194, an application processor, and the like. The GPU is a microprocessor for image processing, and is connected to the display 194 and the application processor. The GPU is used to perform mathematical and geometric calculations for graphics rendering. Processor 110 may include one or more GPUs that execute program instructions to generate or change display information.
The display screen 194 is used to display images, videos, and the like. The display 194 includes a display panel. The display panel may employ a Liquid Crystal Display (LCD) screen (liquid CRYSTAL DISPLAY), an organic light-emitting diode (OLED), an active-matrix organic LIGHT EMITTING diode (AMOLED), a flexible light-emitting diode (FLED), miniled, microLed, micro-OLED, a quantum dot LIGHT EMITTING diodes (QLED), or the like. In some embodiments, the electronic device 100 may include 1 or N display screens 194, N being a positive integer greater than 1.
The external memory interface 120 may be used to connect an external memory card, such as a Micro SD card, to enable expansion of the memory capabilities of the electronic device 100. The external memory card communicates with the processor 110 through an external memory interface 120 to implement data storage functions. For example, files such as music, video, etc. are stored in an external memory card.
The internal memory 121 may be used to store computer executable program code including instructions. The processor 110 executes various functional applications of the electronic device 100 and data processing by executing instructions stored in the internal memory 121. The internal memory 121 may include a storage program area and a storage data area. The storage program area may store an application program (such as a sound playing function, an image playing function, etc.) required for at least one function of the operating system, etc. The storage data area may store data created during use of the electronic device 100 (e.g., audio data, phonebook, etc.), and so on. In addition, the internal memory 121 may include a high-speed random access memory, and may further include a nonvolatile memory such as at least one magnetic disk storage device, a flash memory device, a universal flash memory (universal flash storage, UFS), and the like.
The pressure sensor 180A is used to sense a pressure signal, and may convert the pressure signal into an electrical signal. In some embodiments, the pressure sensor 180A may be disposed on the display screen 194. The pressure sensor 180A is of various types, such as a resistive pressure sensor, an inductive pressure sensor, a capacitive pressure sensor, and the like. The capacitive pressure sensor may be a capacitive pressure sensor comprising at least two parallel plates with conductive material. The capacitance between the electrodes changes when a force is applied to the pressure sensor 180A. The electronic device 100 determines the strength of the pressure from the change in capacitance. When a touch operation is applied to the display screen 194, the electronic apparatus 100 detects the touch operation intensity according to the pressure sensor 180A. The electronic device 100 may also calculate the location of the touch based on the detection signal of the pressure sensor 180A. In some embodiments, touch operations that act on the same touch location, but at different touch operation strengths, may correspond to different operation instructions. For example: and executing an instruction for checking the short message when the touch operation with the touch operation intensity smaller than the first pressure threshold acts on the short message application icon. And executing an instruction for newly creating the short message when the touch operation with the touch operation intensity being greater than or equal to the first pressure threshold acts on the short message application icon.
The touch sensor 180K, also referred to as a "touch panel". The touch sensor 180K may be disposed on the display screen 194, and the touch sensor 180K and the display screen 194 form a touch screen, which is also called a "touch screen". The touch sensor 180K is for detecting a touch operation acting thereon or thereabout. The touch sensor may communicate the detected touch operation to the application processor to determine the touch event type. Visual output related to touch operations may be provided through the display 194. In other embodiments, the touch sensor 180K may also be disposed on the surface of the electronic device 100 at a different location than the display 194.
The software system of the electronic device 100 may employ a layered architecture, an event driven architecture, a microkernel architecture, a microservice architecture, or a cloud architecture. In the embodiment of the application, taking an Android system with a layered architecture as an example, a software structure of the electronic device 100 is illustrated.
For the software architecture of the electronic device 100, the hierarchical architecture divides the software into several layers, each with distinct roles and branches. The layers communicate with each other through a software interface. In some embodiments, the Android system is divided into four layers, as shown in FIG. 2, from top to bottom, an application layer, an application framework layer, an Zhuoyun rows (Android runtime) of system libraries, and a kernel layer, respectively.
The application layer may include a series of application packages. The application layer may include camera, calendar, call, map, navigation, WLAN, bluetooth, music, video, short message, wallpaper, gallery, settings, etc. Applications (APP). One or more application programs such as wallpaper, gallery, setting and the like can be used for executing the image processing method provided by the embodiment of the application. The gallery may also be referred to as an album, a media gallery, or the like.
The application framework layer provides an application programming interface (application programming interface, API) and programming framework for the application of the application layer. The application framework layer includes a number of predefined functions.
The application framework layer may include a window manager, a content provider, a view system, a telephony manager, a resource manager, a notification manager, a data integration module, and the like.
The window manager is used for managing window programs. The window manager can acquire the size of the display screen, judge whether a status bar exists, lock the screen, intercept the screen and the like. The content provider is used to store and retrieve data and make such data accessible to applications. The view system includes visual controls, such as controls to display text, controls to display pictures, and the like. The view system may be used to build applications. The display interface may be composed of one or more views. For example, a display interface including a text message notification icon may include a view displaying text and a view displaying a picture. The telephony manager is used to provide the communication functions of the electronic device 100. The resource manager provides various resources to the application. The notification manager allows the application to display notification information in a status bar, can be used to communicate notification type messages, can automatically disappear after a short dwell, and does not require user interaction.
Android runtime include core libraries and virtual machines. Android runtime is responsible for scheduling and management of the android system.
The core library consists of two parts: one part is a function which needs to be called by java language, and the other part is a core library of android.
The application layer and the application framework layer run in a virtual machine. The virtual machine executes java files of the application program layer and the application program framework layer as binary files. The virtual machine is used for executing the functions of object life cycle management, stack management, thread management, security and exception management, garbage collection and the like.
The system library may include a plurality of functional modules. For example: surface manager (surface manager), media library (media library), three-dimensional graphics processing library, 2D graphics engine, etc.
The surface manager is used to manage the display subsystem and provides a fusion of 2D and 3D layers for multiple applications. Media libraries support a variety of commonly used audio, video format playback and recording, still image files, and the like. The media library may support a variety of audio and video encoding formats. The three-dimensional graphic processing library is used for realizing three-dimensional graphic drawing, image rendering, synthesis, layer processing and the like. The 2D graphics engine is a drawing engine for 2D drawing.
The kernel layer (kernel) is the layer between hardware and software. The kernel layer may include a display driver, a camera driver, an audio driver, a sensor driver, and the like.
With the increasing functions of electronic devices, people can collect images by using the photographing function of the electronic devices, and can also receive images by using the communication function of the electronic devices.
Generally, a user stores a large number of pictures in a gallery of a mobile phone. Often the user wishes to select some fine picture or a manually refined part of the picture to be used as wallpaper for the screen. These exquisite pictures can only be used as background pictures to become static wallpaper, and the dynamic display effect is lacking.
The natural world is always in motion and even scenes that appear to be stationary contain subtle oscillations, caused by wind, currents, respiration or other natural rhythms. Motion is one of the most attractive visual signals to which humans are particularly sensitive. If the captured image does not move (or does not move very much), it is often perceived as unnatural or unrealistic. Static wallpaper cannot meet the demands of people.
The dynamic image can be used as wallpaper. The dynamic image is produced by fusing the multi-mode problems of image, voice, effect rendering and artificial intelligence, and the processing is complex.
Artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) is the theory, method, technique, and application system that simulates, extends, and extends human intelligence using a digital computer or a machine controlled by a digital computer, perceives the environment, obtains knowledge, and uses the knowledge to obtain optimal results. In other words, artificial intelligence is a branch of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar manner to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.
Machine learning is an important branch of artificial intelligence, while deep learning is an important branch of machine learning. Deep learning (DEEP LEARNING) refers to learning, from big data, a representation form (e.g., objects in an image, sounds in audio, etc.) of various things in the real world that can be directly used for computer computation, using a multi-layer neural network structure. In the field of image processing, deep learning has achieved superior results in problems of target detection, image generation, image segmentation, and the like.
Machine learning can be applied to the generation of dynamic images. A moving image may be understood as a video including a plurality of images but not including sound.
Various sounds exist in nature. If only the display of the moving image is performed, the user experience is poor.
For the matching of the audio for the moving pictures, the moving pictures can be dubbed. Thereby enabling enhanced emotional expression, so that the displayed content can be more interactive and attractive. However, in the process of displaying the moving picture, the sound obtained through matching may not be consistent with the content of the image, so that the user experience is affected.
In order to solve the above problems, the present application provides an image processing method and an electronic apparatus.
The image processing method provided by the embodiment of the application is described in detail below with reference to fig. 3. The main execution body of the method provided by the application can be an electronic device or a software/hardware module capable of performing image processing in the electronic device, and for convenience of explanation, the electronic device is taken as an example in the following embodiment.
Fig. 3 is a schematic flowchart of an image processing method provided by an embodiment of the present application. The method may include steps S310 to S320, which are described in detail below, respectively.
Step S310, a to-be-processed image is acquired.
The method for obtaining the image to be processed may be reading the image to be processed in the memory, or receiving the image to be processed sent by other electronic devices, etc. The image to be processed may also be an image selected by the user in the album.
Step S320, according to the image to be processed, generating N target images in the target video and N target audios in the target video, wherein the N target images are in one-to-one correspondence with the N target audios, N is an integer greater than 1, and each target audio is used for playing under the condition that the target image corresponding to the target audio is displayed.
That is, a plurality of target images and target audio corresponding to each target image may be generated from the image to be processed, and the target video may include the plurality of target images and the target audio corresponding to each target image. The target audio corresponding to each target image may also be understood as audio matching the target image.
And generating a plurality of target images according to the image to be processed, and generating audio corresponding to each target image, so that the content of the played target audio and the content of the displayed target image are more coordinated, and the user experience is improved.
In step S320, the image to be processed may be processed by an image processing system to obtain the N target images and the N target audios. However, an excessive number of target images generated from one image may cause the target images to deviate or diverge, have blur, and have poor image quality, so that the user experience is poor.
In the case that the duration of the target video is longer and the number of the target images is greater, in order to enable the generated target images to have higher definition and image quality, the image processing system can be utilized to sequentially process a plurality of first images, at least one second image corresponding to each first image and audio corresponding to each second image. The plurality of target images includes at least one second image corresponding to each first image. The audio corresponding to each second image is the target audio corresponding to the second image. At least one second image corresponding to each first image is an image after the first image in the target video.
The first image processed by the image processing system may be an image to be processed. The ith first image processed by the image processing system may be an image in at least one second image corresponding to the ith-1 first image processed by the image processing system, where i is a positive integer greater than 1. Illustratively, the i-th first image may be the last image in the target video among at least one second image corresponding to the i-1 th first image.
The image processing system includes a trained neural network model.
For example, in a case where the duration of the target video is greater than or equal to the preset duration threshold, the image processing system may sequentially process the plurality of first images.
When the number of at least one second image corresponding to a certain first image is plural, the processing of the first image by the image processing system may be understood as the processing of the first image by the image processing system and the time information, and the different time information indicates a different time interval from the first image. The image processing system processes the first image and certain time information to obtain a second image corresponding to the first image, wherein the time interval of the second image is represented by the time information.
The difference between the time intervals represented by the time information may be an integer multiple of the preset duration, or may be other values. The preset time period can be understood as a time interval between adjacent two target images. That is, the time interval between adjacent two target images may be the same or different.
For example, in a case where the differences between the time intervals represented by the time information are each an integer multiple of the preset time period, the time period of each target audio may be the preset time period, and within the preset time period, there may be a change in the sound in the target audio.
In the case where the difference between the time intervals indicated by the time information is not determined according to the preset time length, the play time length of each target audio may be determined according to the time interval indicated by the time information, and there may be no change in the sound of the target audio.
In the case that the number of at least one second image corresponding to each first image is one, the second images may be images of a preset duration after the first image in the target video. That is, the time intervals between adjacent two target images may be equal.
The image processing system may include a feature prediction model, an image generation model, and an audio generation model.
The feature prediction model is used for processing the first image to obtain at least one predicted feature corresponding to at least one second moment. Each second moment is a moment after a first moment corresponding to the first image in the target video.
The image generation model is used for processing the prediction characteristic corresponding to each second moment to generate a second image corresponding to the second moment.
The audio generation model is used for respectively processing the prediction features corresponding to each second moment to generate target audio corresponding to the second moment.
The target audio that is the same as the second time corresponding to a certain second image may be understood as the target audio corresponding to the second image.
The image generation model and the audio generation model may be generation models.
And processing the first image to obtain a prediction feature corresponding to the second moment, and generating target audio corresponding to the second moment and a second image corresponding to the second moment according to the prediction feature corresponding to the second moment, so that the correlation degree between the target audio and the second image is higher, namely the content coordination is higher, and the user experience is improved.
Illustratively, the feature prediction model may include a motion displacement field prediction model, a motion feature extraction model, an image feature extraction model, and an adjustment model.
The motion displacement field prediction model may also be referred to as a random motion displacement field prediction model, and is configured to process the first image to obtain a motion displacement field corresponding to each second moment, where the motion displacement field corresponding to each second moment represents a displacement of a plurality of pixels in the first image at the second moment relative to the first moment.
The motion feature extraction model is used for extracting features of at least one motion displacement field respectively to obtain motion features corresponding to each second moment.
The motion feature extraction model may be a convolutional neural network. The motion features may include an output of a last layer of the motion feature extraction model and may also include an output of multiple hidden layers of the motion feature extraction model.
The image feature extraction model is used for extracting features of the first image to obtain image features.
The image feature extraction model may be a convolutional neural network. The image features may include an output of a last layer of the image feature extraction model and may also include an output of a plurality of hidden layers of the motion feature extraction model.
The adjustment model is used for adjusting the image characteristics according to the motion characteristics corresponding to each second moment so as to obtain the prediction characteristics corresponding to the second moment.
The adjustment model may be an activation function, for example, an activation function softmax.
The motion displacement field used for representing the displacement of the plurality of pixels in the first image between the first moment and the second moment is predicted, and the image characteristics of the first image are adjusted according to the motion characteristics of the motion displacement field, so that the predicted characteristics are more accurate.
The motion displacement field prediction model may include a first transformation module, a motion texture prediction model, and a second transformation module.
The first transformation module is used for carrying out Fourier transformation on the first image so as to obtain image frequency domain data. The image frequency domain data may be understood as a frequency domain representation of the first image.
The motion texture prediction model may also be referred to as a random motion texture prediction model for processing image frequency domain data to generate a motion texture. The motion texture prediction model may be a generation model, for example, a diffusion model or an implicit diffusion model (latent diffusion models, LDM).
The second transformation module is used for performing inverse Fourier transformation on the motion texture to obtain a motion displacement field.
In the case where the motion texture prediction model is LDM, the motion texture prediction model is LDM including a compression model and a motion texture generation model.
The compression model is used for compressing the image frequency domain data to obtain image compression data. The compression model may be an encoder, for example, a self-encoder in a VAE.
The motion texture generation model is used to process the image compression data to generate a motion texture. The motion texture generation model may be a diffusion model.
Illustratively, the motion texture generation model may include a diffusion model and a decompression model. The diffusion model can be used for carrying out denoising processing on the compressed frequency domain noise data for a plurality of times according to the image compression data so as to obtain compression motion textures. The decompression model may be used to decompress the compressed motion texture to obtain the motion texture. Denoising can be understood as the removal of gaussian noise.
The compressed frequency domain noise data may be obtained by compressing the frequency domain noise data by a compression model, or may be preset.
The target subject indicated by the user in the image to be processed may also be acquired before proceeding to step S320.
The target subject area information may be determined according to the target subject area indicated by the user. The target subject area information indicates a target subject area. A target subject indicated by a user is recorded in a target subject area in the image to be processed.
The image processing system is used for processing the first image and the target main body area information to obtain at least one second image corresponding to the image to be processed and target audio corresponding to each second image.
The content recorded in the other area than the target subject area in each second image is the same as the content recorded in the other area in the first image.
That is, in this other region, the pixel value of the second image at the same pixel as that of the first image is the same. The portion outside the target subject region where the target subject is located does not change in the second image compared to the first image. Therefore, the motion recorded in the target video is performed according to the instruction of the user, the participation of the user to the video generation is improved, and the user experience is improved.
The pixel value of the image may be a color value and the pixel value may be a long integer representing a color. For example, the pixel value may be a color value expressed by Red Green Blue (RGB), and the smaller the value, the lower the luminance, the larger the value, and the higher the luminance in each color component. For a gray image, the pixel value may be a gray value.
In the image processing system, the feature prediction model is used for processing the first image and the target main body area information to obtain the prediction feature corresponding to each second moment. Thus, the influence of the target subject indicated by the user is embodied in the predicted feature. The image generation model and the audio generation model may be subsequently processed based on predicted features that embody the target subject impact of the user indication.
Illustratively, a motion displacement field prediction model in the feature prediction model may be used to process the first image and the target subject region information to obtain a motion displacement field corresponding to each second moment. Thus, the influence of the target subject indicated by the user is embodied in the motion displacement field. The motion feature extraction model and the adjustment model may be subsequently processed based on a motion displacement field embodying the influence of the target subject indicated by the user.
In the processing process of the motion displacement field prediction model on the first image and the target main body region information, the first transformation module can perform Fourier transformation on the first image so as to obtain image frequency domain data. The motion texture prediction model is used for processing the image frequency domain data and the target main body region information to obtain motion textures corresponding to each second moment. The second transformation module is used for performing inverse Fourier transformation on the motion texture to obtain a motion displacement field.
In the motion texture prediction model, a compression model may be used to compress image frequency domain data to obtain image compression data. The motion texture generation model may be used to process the image compression data and the target subject area information to generate a motion texture.
Or the compression model may be further configured to compress the target subject area information to obtain compressed area information. The motion texture generation model may be used to process the compressed region information and the image compression data to generate a motion texture.
After acquiring the target subject indicated by the user in the image to be processed, if the image processing system processes the first image and the target subject area information for each first image, the movement of the subject in the target video may be unreasonable. For example, in practice, movement of the target subject may result in movement of other subjects, which may be located outside of the target subject area.
In order to make the movement of the main body in the target video more reasonable, the image processing system can process the first image and the target main body region information under the condition that the first image processed by the image processing system is the image to be processed; in the case where the first image processed by the image processing system is not the first image, the image processing system may process the first image, and the target subject area information may no longer be input to the image processing system. Therefore, the motion of the target main body and the motion of other main bodies, which are affected by the target main body, can be reflected in the target video, so that the motion of the main body in the target video is more reasonable, and the user experience is improved.
In the case of acquiring the target subject indicated by the user in the image to be processed, the target movement trend of the target subject indicated by the user may also be acquired before step S320.
The target motion trend of the target subject may include one or more of a motion direction, a motion amplitude, a motion speed, and the like of the target subject.
The image processing system may be configured to process the first image, the target subject area information, and the motion information to obtain the target image corresponding to each second time in at least one second time, and the target audio corresponding to each second time, where the target subject moves in at least one target image corresponding to the at least one second time according to a target motion trend represented by the motion information.
The image processing system generates a target video according to the target main body indicated by the user and the target motion trend of the target main body, the motion of the target main body in the target video is carried out according to the target motion trend, the participation of the user on video generation is improved, and the user experience is improved.
The target motion trend of the target subject indicated by the user is generally for the image to be processed among the images to be processed. Since the movement of the target subject may be affected by various factors, the movement trend of the target subject may not always be the same as the movement trend of the target in the target video.
In the case that the first image processed by the image processing system is an image to be processed, the image processing system may process the first image, the target subject area information, and the motion information; in the case where the first image processed by the image processing system is not the first image, the image processing system may process the first image, or the image processing system may process the first image and the target subject area information, and the motion information may no longer be input to the image processing system.
Illustratively, a motion displacement field prediction model in the feature prediction model may be used to process the first image, the target subject region information, and the motion information to obtain a motion displacement field corresponding to each second moment. Thus, the influence of the target motion trend of the target main body indicated by the user is reflected in the motion displacement field. The motion feature extraction model and the adjustment model may be processed later based on the motion displacement field.
In the processing process of the motion displacement field prediction model on the first image, the target main body region information and the motion information, the first transformation module can perform fourier transformation on the first image so as to obtain image frequency domain data. The motion texture prediction model is used for processing the image frequency domain data, the target main body region information and the motion information to obtain motion textures corresponding to each second moment. The second transformation module is used for performing inverse Fourier transformation on the motion texture to obtain a motion displacement field.
In the motion texture prediction model, a compression model may be used to compress image frequency domain data to obtain image compression data. The motion texture generation model may be used to process the image compression data, the target subject area information, and the motion information to generate a motion texture.
Or the compression model may be further configured to compress the target subject area information to obtain compressed area information. The motion texture generation model may be used to process the compressed region information, the image compression data, and the motion information to generate a motion texture.
Or the compression model can respectively compress the image frequency domain data, the target main body region information and the motion information, and the motion texture generation model can be used for processing the compressed image frequency domain data, the compressed target main body region information and the compressed motion information to generate the motion texture.
Therefore, the motion of the target main body and other main bodies in the target video can be reflected to be influenced by the target main body, and the target main body starts to move according to the target motion trend, so that the motion of the main body in the target video is more reasonable, and the user experience is improved.
According to the image processing method provided by the embodiment of the application, a plurality of target images are generated according to the image to be processed, and the audio corresponding to each target image is generated, so that the coordination between the content of the played target audio and the content of the displayed target image is higher, and the user experience is improved.
It should be understood that the above description is intended to aid those skilled in the art in understanding the embodiments of the present application, and is not intended to limit the embodiments of the present application to the specific values or particular scenarios illustrated. It will be apparent to those skilled in the art from the foregoing description that various equivalent modifications or variations can be made, and such modifications or variations are intended to be within the scope of the embodiments of the present application.
An image processing system used in the method shown in fig. 3 will be described below with reference to fig. 4.
Fig. 4 is a schematic block diagram of an image processing system according to an embodiment of the present application.
Image processing system 400 includes random motion displacement field prediction model 410, motion feature extraction model 420, image feature extraction model 430, adjustment model 440, image generation model 450, audio generation model 460.
The image processing system 400 may be configured to sequentially process the plurality of first images to obtain a plurality of second images and audio corresponding to each of the second images. The output of the image processing system 400 includes sound and images, and the output of the image processing system 400 can be understood to be multi-modal.
The processing of the plurality of first images by the image processing system 400 is performed sequentially. The first image used by the image processing system 400 for the first processing among the plurality of first images may be an image to be processed. The first image used by the image processing system 400 for the second and subsequent processes may be the second image obtained by the previous process. Video can be obtained according to the plurality of second images and the audio corresponding to each second image. The video may include a plurality of second images and audio corresponding to each of the second images. Among the plurality of second images, different second images correspond to different times. The video may be understood as video generated by the image processing system 400.
The video may also include, for example, an image to be processed. The image to be processed may be understood as the first frame image of the video.
The number of second images that the image processing system 400 processes the first image at a time may be one or more.
In the case where the number of second images obtained by the image processing system 400 processing the first image is one, the second image is an image corresponding to a time subsequent to the time corresponding to the first image. This second image may be used as the first image for the next processing by the image processing system 400.
In the case where the number of second images obtained by the image processing system 400 processing the first image is plural, the plural second images are images corresponding to plural times after the time corresponding to the first image, respectively. The image with the latest corresponding moment in time in the plurality of second images may be used as the first image processed by the image processing system 400 next time. The image having the latest corresponding time among the plurality of second images may be understood as an image having the largest time interval between the corresponding time and the time corresponding to the first image.
In the case where the first image is an image to be processed, the input of the image processing system 400 may further include target subject area information and/or motion information, and the like.
The target subject area information is used to represent a target subject area in the image to be processed. The target subject area may be determined from a target subject in the image to be processed indicated by the user. In the image to be processed, the target subject is located in the target subject region. The shape of the target subject area may be preset or may be determined according to the shape of the target subject. The shape of the target subject area may be regular or irregular.
The target subject region may also include a region around the target subject in the image to be processed, for example, the target subject region may include all or a part of a region in the image to be processed in which a distance from the region where the target subject is located is less than or equal to a preset distance.
The target subject area information may be represented by an image or other means. For example, the target subject region information may be a region image. The region image has the same size as the image to be processed. The region image may represent whether each pixel is located in the target subject region by a value of the pixel. Illustratively, the value of the pixel located in the target subject area may be 1, and the value of the pixel located outside the target subject area may be 0; or the value of the pixel point located in the target subject area is greater than or equal to a preset value, and the value of the pixel point located outside the target subject area may be less than the preset value. Different pixel points may be understood as different points, i.e. different positions.
In the case where the shape of the target subject area is a preset shape, the target subject area information may represent the target subject area by information such as position and size.
For example, if the preset shape is square, the target subject area information may include positions of two vertices located on a diagonal line of the target subject area in the head portrait to be processed, or the target subject area information may include a position of a center of the target subject area in the head portrait to be processed, and information indicating a length and a width of the target subject area, such as a length and a width, or a length and an aspect ratio, or the like. And if the preset shape is a circle, the target subject area information may include a position of a center of the target subject area in the head portrait to be processed, and information for representing a radius of the target subject area.
On the image to be processed, a coordinate system can be established by taking the center or a certain vertex of the image to be processed as an origin. The position of a point in the head portrait to be processed can be represented by means of coordinates.
The motion information may be used to represent a target motion trend of the target subject indicated by the user, such as one or more of a motion amplitude, a motion direction, a motion speed, and the like.
The first image may be denoted as I 0. The random motion displacement field prediction model 410 is configured to process the first image I 0 to obtain at least one random motion displacement field of the first image I 0, where the at least one random motion displacement field of the first image I 0 corresponds to at least one time instant after a time instant corresponding to the first image I 0. The moments corresponding to different random motion displacement fields are different.
The displacement field refers to the displacement of a plurality of points in space. The displacement of each point includes a magnitude and a direction. The displacement of each point may be represented by a vector.
The random preset displacement field of the first image I 0 at a certain time point represents the displacement of the positions of the plurality of pixels in the first image I 0 at the certain time point relative to the positions of the plurality of pixels in the first image I 0. The plurality of pixels in the first image I 0 that the random preset displacement field represents may be all or part of the pixels in the first image I 0.
The motion feature extraction model 420 is used for extracting features of the random motion displacement field D of the first image I 0, so as to obtain motion features. The motion characteristics may be understood as characteristics of the random motion displacement field D of the first image I 0.
The image feature extraction model 430 is used for extracting features of the first image I 0 to obtain first features. The first feature may be understood as a feature of the first image I 0, which may also be referred to as an image feature.
The adjustment model 440 is used to adjust the first feature according to the motion feature to obtain the second feature. The second feature may also be referred to as a predictive feature.
The image generation model 450 is used to process the second feature to generate a second image.
The audio generation model 460 is configured to process the second feature to generate audio corresponding to the second image.
The random motion displacement field of the first image I 0 may have the same size as the first image I 0.
Random motion displacement field prediction model 410, motion feature extraction model 420, image feature extraction model 430, image generation model 450, audio generation model 460 are all deep neural networks (deep neural network, DNN).
The neural network may be composed of neural units, which may refer to an arithmetic unit with x s and an intercept l as inputs, and the output of the arithmetic unit may be:
Where s=1, 2, … … n, n is a natural number greater than 1, W s is the weight of x s, and b is the bias of the neural unit. f is an activation function (activation functions) of the neural unit for introducing a nonlinear characteristic into the neural network to convert an input signal in the neural unit to an output signal. The output signal of the activation function may be used as an input to the next convolutional layer. The activation function may be a sigmoid function. A neural network is a network formed by joining together a number of the above-described single neural units, i.e., the output of one neural unit may be the input of another. The input of each neural unit may be connected to a local receptive field of a previous layer to extract features of the local receptive field, which may be an area composed of several neural units.
DNNs may also be referred to as multi-layer neural networks, which may be understood as neural networks having multiple hidden layers. The DNNs are divided according to the positions of different layers, and the neural networks inside the DNNs can be divided into three types: input layer, hidden layer, output layer. Typically the first layer is the input layer, the last layer is the output layer, and the intermediate layers are all hidden layers. The layers are fully connected, that is, any neuron in the i-th layer must be connected to any neuron in the i+1-th layer.
While DNN appears to be complex, it is not actually complex in terms of the work of each layer. The operation of each layer in the deep neural network can be expressed mathematicallyTo describe: the work of each layer in a physical layer deep neural network can be understood as completing the transformation of input space into output space (i.e., row space to column space of the matrix) by five operations on the input space (set of input vectors), including: 1. dimension increasing/decreasing; 2. zoom in/out; 3. rotating; 4. translating; 5. "bending". Wherein the operations of 1,2 and 3 are represented by/>Completed, operation of 4 is defined by/>The operation of 5 is then completed by/>To realize the method. The term "space" is used herein to describe two words because the object being classified is not a single thing, but rather a class of things, space referring to the collection of all individuals of such things. Wherein/>Is a weight vector, each value in the vector representing the weight value of a neuron in the layer neural network. The vector/>The spatial transformation of the input space into the output space described above is determined, i.e. the weights per layer/>Controlling how the space is transformed. The purpose of training the deep neural network is to finally obtain a weight matrix (a weight matrix formed by a plurality of layers of vectors W) of all layers of the trained neural network. Thus, the training process of the neural network is essentially a way to learn and control the spatial transformation, and more specifically to learn the weight matrix.
Thus, DNN is simply a linear relationship expression as follows: Wherein/> Is an input vector,/>Is an output vector,/>Is an offset vector, W is a weight matrix (also called coefficient),/>Is an activation function. Each layer is only for input vectors/>The output vector/>, is obtained through the simple operation. Since the number of DNN layers is large, the coefficient W and the offset vector/>And the number of (2) is also relatively large. The definition of these parameters in DNN is as follows: taking the coefficient W as an example: it is assumed that in a three-layer DNN, the linear coefficients of the 4 th neuron of the second layer to the 2 nd neuron of the third layer are defined as/>. The superscript 3 represents the number of layers in which the coefficient W is located, and the subscript corresponds to the output third layer index 2 and the input second layer index 4.
In summary, the coefficients of the kth neuron of the L-1 layer to the jth neuron of the L layer are defined as。
It should be noted that the input layer is devoid of W parameters. In deep neural networks, more hidden layers make the network more capable of characterizing complex situations in the real world. Theoretically, the more parameters the higher the model complexity, the greater the "capacity", meaning that it can accomplish more complex learning tasks. The process of training the deep neural network, i.e. learning the weight matrix, has the final objective of obtaining a weight matrix (a weight matrix formed by a number of layers of vectors W) for all layers of the trained deep neural network.
The random motion displacement field prediction model 410 may be a neural network model, for example, a convolutional neural network.
The convolutional neural network is a deep neural network with a convolutional structure, and is a deep learning (DEEP LEARNING) architecture, wherein the deep learning architecture is to learn multiple layers of learning at different abstraction levels through a machine learning algorithm. As a deep learning architecture, CNN is a feed-forward artificial neural network in which individual neurons can respond to data input thereto.
The convolutional neural network comprises a feature extractor consisting of a convolutional layer and a sub-sampling layer. The feature extractor can be seen as a filter and the convolution process can be seen as a convolution with an input data or convolution feature plane (feature map) using a trainable filter. The convolution layer refers to a neuron layer in the convolution neural network, which performs convolution processing on an input signal. In the convolutional layer of the convolutional neural network, one neuron may be connected with only a part of adjacent layer neurons. A convolutional layer typically contains a number of feature planes, each of which may be composed of a number of neural elements arranged in a rectangular pattern. Neural elements of the same feature plane share weights, where the shared weights are convolution kernels. Taking the input data as an image as an example, the sharing weight can be understood as the way of extracting the image information is independent of the position. The underlying principle in this is: the statistics of a certain part of the image are the same as other parts. I.e. meaning that the image information learned in one part can also be used in another part. So we can use the same learned image information for all locations on the image. In the same convolution layer, a plurality of convolution kernels may be used to extract different image information, and in general, the greater the number of convolution kernels, the more abundant the image information reflected by the convolution operation.
The convolution kernel can be initialized in the form of a matrix with random size, and reasonable weight can be obtained through learning in the training process of the convolution neural network. In addition, the direct benefit of sharing weights is to reduce the connections between layers of the convolutional neural network, while reducing the risk of overfitting.
The convolutional neural network may include an input layer, a convolutional layer, and a neural network layer. The convolutional neural network may also include a pooling layer.
The convolution layer may comprise a number of convolution operators, also called kernels, which function in natural language processing as a filter that extracts specific information from the input speech or semantic information, the convolution operators being essentially a weight matrix, which is usually predefined.
The weight values in the weight matrixes are required to be obtained through a large amount of training in practical application, and each weight matrix formed by the weight values obtained through training can extract information from input data, so that the convolutional neural network is helped to conduct correct prediction.
Since it is often desirable to reduce the number of training parameters, the convolutional layer often requires periodic introduction of a pooling layer later. The convolution layer may be followed by a pooling layer, or the convolution layer may be followed by one or more pooling layers.
The only purpose of the pooling layer during data processing is to reduce the spatial size of the data. The pooling layer may include an average pooling operator and/or a maximum pooling operator for sampling input data features to obtain smaller sized data features.
After the convolutional layer/pooling layer processing, the convolutional neural network is not yet sufficient to output the required output information. Because, as previously mentioned, the convolutional layer/pooling layer will only extract features and reduce the parameters imposed by the input data. However, in order to generate the final output information (the required class information or other relevant information), convolutional neural networks need to utilize a neural network layer to generate the output of one or a set of the required number of classes. Thus, multiple hidden layers may be included in the neural network layer, and the parameters included in the multiple hidden layers may be pre-trained based on relevant training data for a particular task type, which may include, for example, speech or semantic recognition, classification or generation, etc.
After the hidden layers in the neural network layer, that is, the final layer of the whole convolutional neural network is an output layer, the output layer has a class cross entropy-like loss function, and is specifically used for calculating a prediction error, and once the forward propagation of the whole convolutional neural network is completed, the backward propagation starts to update the weight values and the deviations of the layers so as to reduce the loss of the convolutional neural network and the error between the result output by the convolutional neural network through the output layer and the ideal result.
It should be appreciated that the above description of a convolutional neural network is merely an example of one type of convolutional neural network, and that convolutional neural networks may exist in the form of other network models in particular applications. For example, the plurality of convolution layers and the pooling layer are parallel, and the features extracted respectively are input to the full neural network layer for processing.
Random motion displacement field prediction model 410 may be a generative model.
The generative model is essentially a set of probability distributions. As shown in FIG. 5 (a), training data in the training data set can be understood as being distributed asIs a random sample taken from the data set and distributed independently and co-operatively with the data set. The right side is its generative model (i.e., generative model). Generating model determination and distribution/>Distribution nearest/>Then in distribution/>And new samples are collected, so that continuous new data can be obtained.
The structure of the generative model may be a generative antagonism network (GENERATIVE ADVERSARIAL networks, GAN), a Variable Automatic Encoder (VAE), a flow-based model (flow-based model), a diffusion model, or the like.
Fig. 5 (b) shows the principle of GAN. GAN includes at least two modules: one module is a generative model (GENERATIVE MODEL or generator) and the other module is a discriminant model (DISCRIMINATIVE MODEL or discriminator) through which the two modules learn to game each other, resulting in a better output. The basic principle of GAN is as follows: taking GAN for generating a picture as an example, assuming that there are two networks, generating a model G and a discrimination model D, wherein G is a network for generating a picture, which receives a random noise z, and generates a picture through the noise, and is denoted as G (z); d is a discrimination network for discriminating whether a picture is "true". In an ideal state, G may generate enough "spurious" pictures G (z), while D has difficulty in determining whether the G generated pictures are true or not, i.e., D (G (z))=0.5. This results in an excellent generation model G that can be used to generate pictures or other types of data.
Fig. 5 (c) shows the principle of VAE. VAEs are depth generation models based on self-encoder structure. The self-encoder is widely applied in the fields of dimension reduction, feature extraction and the like. The VAE includes a self-encoder and a decoder, and maps data into hidden variables of a low-dimensional space through an encoding process of the self-encoder. Then, the decoder performs decoding process to restore the hidden variable to the reconstructed data. In order to enable the decoding process to have a generating capability, rather than a unique mapping relationship, the hidden variable is a random variable subject to normal distribution.
Fig. 5 (d) shows the principle of the stream-based generation model. The stream-based generation model may also be referred to as a stream model. The flow model comprises a series of reversible transformers. The data x is processed using the series of reversible transformers, which may be denoted as f (x), resulting in transformed data z. The reconstructed data can be obtained by performing inverse transformation f -1 (z) of f (x) on the transformed data z. The flow model may enable the model to learn more precisely the data distribution, whose loss function is a negative log-likelihood function. The training process of the stream model can be understood as the reversible transformation f (x) required to learn the transformation from a complex distribution of data x to a simple distribution of data z. The training process of the stream model obtains data z by random sampling, and converts the data z into reconstruction data by using the inverse transform f -1 (z) of f (x).
Fig. 5 (e) shows the principle of the diffusion model. The diffusion model is essentially a Markov chain architecture, except that the training process uses a deep learning back propagation algorithm. The idea of the diffusion model is derived from non-equilibrium thermodynamics, and the theoretical basis of the algorithm is to infer the training parameterized Markov chain by variation. One key property of a markov chain is stationarity, i.e. if a probability varies over time, it tends to a certain plateau distribution under the influence of the markov chain, the longer the time the more plateau the distribution. The process requires the property of "memorability", i.e. the probability distribution of the next state, which can only be determined by the current state, and in time series, the preceding events are independent of it.
The framework of the diffusion model may employ a denoising diffusion probability model (denoising diffusion probabilistic model, DDPM), a score-based generation model (score-based Generative Models, SGM), a random differential equation (stochastic differential equation, SDE), or the like.
The model structure of the diffusion model may be a convolutional neural network, such as a U-network (U-NET), a gradient denoising-based fractional model (noise conditional score networks, NCSN), an upgraded version of NCSN (NCSN++), and the like.
The use of diffusion models involves a forward diffusion process (forward diffusion process) and a reverse diffusion process (reverse diffusion process).
The forward diffusion process may also be referred to as a forward diffusion process, a forward process, or a diffusion process, corresponding to a diffusion process in molecular thermodynamics. The back diffusion process may also be referred to as a back diffusion process, an abstraction process, a back process, or a denoising process. During forward diffusion, a Markov chain of samples is generated by slowly adding noise. During back diffusion, the markov chain of the samples is generated by slowly removing the noise.
A markov chain comprises a series of states, where states refer to data containing different noise levels, and a series of probability of change, where probability of change refers to probability of change from a current state to a next state, implemented using a change matrix.
As shown in (e) of fig. 5, for dataIn the forward diffusion process, in the T-th step of the T-step of the forward diffusion process, data/>Adding a small amount of Gaussian noise to obtain data/>Data/>Representing the data obtained over step t. Wherein t=1, 2 …, T is a positive integer. That is, for data/>After T steps, the data obtained in each step are respectively/>,,…,/>. Data/>Is pure noise.
The noise added at different step index t may be different, and the variation of the noise added at different step index t may be referred to as a difference schedule, for example, a linear schedule, a square schedule, a variance schedule, or the like. The differential schedule may be preset.
For the image data shown in fig. 5 (f), the forward diffusion process is to add noise to the image until the image becomes a pure noise. From dataTo data/>A markov chain, represents a random process in which a state space passes through transitions from one state to another.
During forward diffusion, the step index t may represent the number of times noise is added.
Data,/>,…,/>May be images of the same size, or may be audio, frequency domain data, or other types of data.
Obtained data by forward diffusion processTo data/>The diffusion model may be trained. And processing training noise data corresponding to each step index T under the condition that the step index T is t=1, 2 … and T respectively by using the initial diffusion model to obtain training denoising data corresponding to each step index T, and adjusting parameters of the initial diffusion model according to the difference between the training denoising data corresponding to the step index T and the label denoising data corresponding to the step index T. The difference may be represented by a loss value. The diffusion model is an initial diffusion model after parameter adjustment.
The training noise data corresponding to the step index t=t may be pure noise data, the training noise data corresponding to the step index T may be understood as data obtained by adding T times of noise to the data without noise when T is other than T, and the label denoising data corresponding to the step index T may be understood as data obtained by adding T-1 times of noise to the data without noise.
That is, training noise data corresponding to the step index t=t can be understood as dataTraining noise data corresponding to the step index T under the condition that T is other than T can be understood as data/>The label denoising data corresponding to the step index t can be understood as data/>。
In the forward diffusion process, training noise data corresponding to the step index t=t may be pure noise data, and training noise data corresponding to the step index T may be obtained by removing T-1 times of noise from the pure noise data in the initial diffusion model under the condition that T is other than T.
The output of the initial diffusion model may be training denoising data, i.e., the training denoising data may be the output of the initial diffusion model. Or the output of the initial diffusion model may be noise, i.e. the training denoising data may be derived by removing noise from the output representation of the initial diffusion model on the basis of the training noise data.
Under the condition that the training denoising data corresponding to the step index t is obtained by removing noise represented by the output of the initial diffusion model on the basis of the training noise data corresponding to the step index t, the parameters of the initial diffusion model are adjusted according to the difference between the training denoising data corresponding to the step index t and the label denoising data, and the parameters of the initial diffusion model can be adjusted according to the difference between the output of the initial diffusion model and the noise added by the training noise data relative to the training denoising data. The difference may be represented by a loss value.
The back diffusion process can be understood as the reasoning process of the diffusion model. In the back diffusion process, t can be understood as the number of remaining iterations plus 1.
During back diffusion, from noisy dataStarting from, gradually remove data/>Noise in (1)/>, dataRestore to original data/>. Fig. 5 (e) shows the original data/>, as image dataAn example of an image will be described. For the image data shown in fig. 5 (e), the back diffusion process gradually denoises the pure noise image from it until a true image is obtained. The pure noise image may also be referred to as a gaussian noise image or noise image.
As shown in fig. 5 (g), the input data is denoised based on the step index t by using the diffusion model obtained by training, and output data is obtained. Input data can be understood as dataOutput data can be understood as data/>。
Both the forward diffusion process and the backward diffusion process can be represented by corresponding random differential equations and ordinary differential equations. And the optimization objective of the diffusion model (predicting the noise added at each step) can be understood as actually learning a gradient direction that is optimal for the target data distribution for the current input. This is well in line with the visual understanding that the noise we add to the input actually keeps the input away from its original data distribution, while learning a data-space optimal gradient direction is actually equivalent to walking in the direction of removing the noise. In the back diffusion process, a deterministic sampling result can be obtained by using a deterministic ordinary differential equation; in the forward diffusion process, the final noise-added result of the forward diffusion process can also be obtained by constructing the inverse of the ordinary differential equation of the backward diffusion process (in practice, if we have a deterministic path, the forward diffusion process and the backward diffusion process are not going forward and backward). This conclusion makes diffusion generation highly controllable without fear of diffusion model versus dataProcessing the resulting data/>And data/>Completely similar, a range of regulation is made possible.
The training process of the diffusion model can be simply understood as a process of gradually removing gaussian noise from pure noise data through neural network learning.
For a T-step diffusion model, each step is indexed by T. During forward diffusion we are from a real imageInitially, some gaussian noise is randomly generated at each step, then the generated noise is gradually added to the input image, and when T is sufficiently large, the resulting noisy image approximates a gaussian noise image, e.g., t=1000 may be set for model DDPM. Learning the neural network from data/>, through a training processTo data/>The added noise is then spread out in the back-diffusion process from the noise data/>The last image to be generated is obtained by gradually removing noise at the beginning (the result of adding noise to the real image during training and random noise during sampling). This means that each processing of the diffusion model is used to generate data/>, which is input withSimilar data/>. Fundamentally, the working principle of the diffusion model is to destroy noise data/>, by continuously adding Gaussian noiseThe recovered data is then learned by reversing this noise-increasing process. And denoising the randomly sampled noise for a plurality of times by using a diffusion model, and generating data through a back diffusion process.
In the back diffusion process, the diffusion model is input, except for the dataConditions may also be included. The condition may include information for indicating the step index t, and may include other information. This other information may be associated with data/>The data generated by the diffusion model is thus more consistent with this other information.
The diffusion model is in essence a hidden variable model (latent variable model) that is mapped to hidden space (LATENT SPACE) using Markov Chains (MC). Noise is gradually added to the data in each step t by a Markov chainTo obtain posterior probability/>Wherein/>Data/>, representing the cases where T is 1 to T, respectivelyRepresenting the input data, and also being a hidden space, the hidden space having the same dimensions as the input data.
In some embodiments, the random motion displacement field of the first image I 0 is predicted in the temporal domain.
By means of autoregressive mode, prediction of random motion displacement field can be performed in time domain. Autoregressive is a method of predicting the value that should be output next based on the sequence of values at some previous time.
The random motion displacement field prediction model 410 may process the first image to obtain a random motion displacement field for the first image I 0. The moment corresponding to the random motion displacement field of the first image I 0 is a moment after the moment corresponding to the first image.
In the case where the random motion displacement field prediction model 410 predicts the random motion displacement field of the first image I 0 in the time domain, the training data of the random motion displacement field prediction model 410 may include training samples and a tag random motion displacement field. Training samples in the training data of the random motion displacement field prediction model 410 include sample images.
And processing the training sample by using the initial random motion displacement field prediction model to obtain a training random motion displacement field. Based on the difference between the training random motion displacement field and the tag random motion displacement field, the parameters of the initial random motion displacement field prediction model are adjusted to obtain the random motion displacement field prediction model 410. The difference may be represented by a loss value. Random motion displacement field prediction model 410 may be understood as an initial random motion displacement field prediction model with parameters adjusted.
The sample image may be a frame of image in the training video, and the tag random motion displacement field may be a random motion displacement field of the sample image at a training time determined according to the training video. The training time is the time after the time corresponding to the sample image in the training video. The random motion displacement field of the label, namely the random motion displacement field of the sample image at the training moment, represents the positions of a plurality of pixels in the sample image at the training moment relative to the moment corresponding to the sample image.
Illustratively, the random motion displacement field prediction model 410 may also process the first image and the temporal information to obtain a random motion displacement field at a time represented by the temporal information. The time indicated by the time information may be a time after the time corresponding to the first image. The time corresponding to the random motion displacement field is the time indicated by the time information.
In the case where the input of the random motion displacement field prediction model 410 does not include time information, the time intervals between the time points corresponding to the random motion displacement field obtained by processing the different first images by the random motion displacement field prediction model 410 and the time points corresponding to the first images processed by the random motion displacement field prediction model 410 may be equal. That is, the random motion displacement field prediction model 410 processes the first image, and may obtain a random motion displacement field corresponding to a time point between time points corresponding to the first image which is a preset time interval.
Whereas in the case where the input of the random motion displacement field prediction model 410 includes temporal information, the temporal information may be understood as temporal embedding (embedding). In the case of a change in the temporal information, the random motion displacement field prediction model 410 may output a random motion displacement field corresponding to a moment represented by different temporal information. That is, by introducing the temporal information, the random motion displacement field at different times after the time of the first image can be predicted from the first image.
Where the input of the random motion displacement field prediction model 410 includes time information, the training samples in the training data of the random motion displacement field prediction model 410 may also include training time information. The training time information is used to represent training time.
Illustratively, the first image input into the random motion displacement field prediction model 410 may be the first image after adding noise. The noise added to the first image can be understood as noise added in the spatial domain. In the case where the first image in the input random motion displacement field prediction model 410 is a first image to which noise is added, in the training data of the random motion displacement field prediction model 410, the sample image may be an image obtained by adding noise to one frame of image in the training video.
Illustratively, the time information in the random motion displacement field prediction model 410 may be time information after adding noise. The time information after adding noise may represent a training time after adding noise. Adding noise to the time information can be understood as adding noise in the time domain. In the case where the first image input to the random motion displacement field prediction model 410 is the first image to which noise has been added, the training time information may be information obtained by adding noise to the information representing the training time in the training data of the random motion displacement field prediction model 410.
By adding noise in the time or spatial domain, it can be understood that motion adds random noise, adding randomness to motion.
From the random motion displacement field at a certain moment, a second image at that moment can be determined. The second image may be processed again by the image processing system 400 as the first image, so that a second image at a larger time can be obtained.
In the case where the sample image is determined from the first frame image in the training video, the training samples in the training data of the random motion displacement field prediction model 410 may also include training target subject region information and/or training motion information.
The training target subject region information may be used to indicate a region in which the training target subject is located in the first frame image of the training video, which may also be referred to as a training target subject region.
The training target body is in a moving state within a period of time after the training video starts. For example, in the training video, the training target subject may be always in a state of motion.
And in a period of time after the training video starts, other subjects except the training target subject in the training video are in a static state.
The training motion information may represent one or more of a magnitude of motion of the training target subject in the training video, a direction of motion of the training target subject at the beginning of the training video, and the like.
The time length of the training video may be preset, may be determined randomly, may be determined according to a preset rule, and may be determined manually.
Related studies in computer graphics have shown that natural motion, particularly oscillatory motion, can be described as a superposition of a set of resonances, which can be represented by different frequencies, amplitudes and phases.
Adding random noise in the time domain and/or the spatial domain may result in an unrealistic or unstable picture. The prediction of the random motion displacement field is performed in a time domain, in the video generated according to the random motion displacement field, in a long time period, good time consistency and spatial continuity cannot be guaranteed, and as the number of generated second images increases, the video comprising the plurality of second images may drift or diverge.
The video drifts and diverges, which can be understood that the object recorded in the image corresponding to the later frame has ghost or blurring and even cannot be recognized.
In other embodiments, the random motion displacement field may be predicted in the frequency domain.
As shown in fig. 6, the random motion displacement field prediction model 410 may include a first transform module 411, a random motion texture prediction model 412, and a second transform module 413.
The first transform module 411 is configured to perform fourier transform on the first image I 0, and convert the first image I 0 to a frequency domain, so as to obtain image frequency domain data. The fourier transform may be implemented by a fast fourier transform (fast Fourier transform, FFT).
The random motion texture prediction model 412 is used to process the image frequency domain data to obtain the random motion texture S of the first image I 0.
The random motion texture S may also be referred to as a random motion spectrum comprising a frequency representation of the motion trajectories of all or part of the pixels of the first image I 0, i.e. the motion trajectories in the frequency domain. The random motion texture S includes a frequency representation of the motion profile of each of the plurality of pixels. The frequency representation of the motion trajectory of each pixel may include a component at each of the K frequencies f 0 to f K.
The frequency representation of the motion profile of the p-located pixel in the first image I 0 may be represented as. The pixel at position p may be a certain pixel of the first image I 0. A pixel with a position p can also be understood as pixel p.
The second transformation module 413 is configured to perform an inverse fourier transform on the random motion texture S of the first image I 0, so as to convert the random motion texture S into a random motion displacement field D of the first image I 0 corresponding to at least one time point after the time point corresponding to the first image I 0.
Illustratively, the second transformation module 413 is used for frequency representation of the motion trajectories of the pixels p in the random motion texture S of the first image I 0 Converting to obtain the motion displacement/>, of the pixel p at least one moment。
That is, the random motion displacement field D corresponding to the first image I 0 at a time instant includes the motion displacement of each pixel p of the plurality of pixels in the random motion texture S representation at that time instant。
In the random motion displacement field D at each preset moment, the motion displacement of the pixel p at a certain momentFor indicating a change in the position of the pixel p at that moment in time relative to the position in the first image I 0, i.e. for indicating a displacement of the pixel p at that moment in time relative to the moment corresponding to the first image. That is, at this certain time, the position of the pixel p can be expressed as/>. Thus, in combination with the position of the pixel p in the first image I 0, the displacement/>, according to the motion of the pixel pThe position of the pixel p in the image at that moment can be determined.
The training data of the random motion texture prediction model 412 may include training samples and labeled random motion textures. The training samples of the random motion texture prediction model 412 may include training image frequency domain data. And processing the training sample by using the initial random motion texture prediction model to obtain the training random motion texture. Based on the difference between the training random motion texture and the label random motion texture, parameters of the initial random motion texture prediction model are adjusted to obtain a random motion texture prediction model 412. The random motion texture prediction model 412 is the initial random motion texture prediction model after parameter adjustment. Adjustment of parameters of the initial random motion texture prediction model may minimize or gradually converge the difference between the training random motion texture and the label random motion texture. The difference may be represented by a loss value.
The training image frequency domain data may be obtained by fourier transforming the sample images in the training video. The sample image may be a frame of image in a training video.
The label random motion texture includes a frequency representation of the motion trajectory for each of a plurality of pixels in the sample image. The random motion texture of the tag can be obtained by analyzing a section of video after a sample image in the training video.
The motion trail of each pixel in the plurality of pixels in the sample image can be obtained by processing a section of video after the sample image in the training video in a feature tracking (feature tracking), optical flow extraction (optical flow) or particle video (partial video) manner. And carrying out Fourier transform on the motion trail of each pixel point respectively to obtain the frequency domain representation of the motion trail, thereby obtaining the random motion texture of the label.
Feature tracking is a critical task in computer vision. Feature tracking by tracking feature points, the change in the position of feature points can be determined in the video. The image characteristic points with common properties in a plurality of images of the video are detected, and the same characteristic points in different images are matched, so that the comparison and tracking of the images can be realized, the positions of the characteristic points in different images in the video are determined, the tracking of the characteristic points is realized, and the motion estimation of the characteristic points is performed.
And determining the motion trail of the characteristic points in the first frame image in the video according to the motion estimation result of the characteristic points. The motion trail of the feature point can be understood as the motion trail of the pixel point.
Optical flow (or optical flow) is a concept in detection of object motion in the field of view. To describe the movement of an observed object, surface or edge caused by movement relative to an observer. The optical flow is the instantaneous speed of the pixel motion of a space moving object on an observation imaging plane, and is a method for finding the corresponding relation between the previous frame and the current frame by utilizing the change of the pixels in an image sequence on a time domain and the correlation between the adjacent frames, so as to calculate the motion information of the object between the adjacent frames.
Particle video (particle video) is a video and corresponding particle set. That is, particle video uses a set of particles to represent video motion. The particle i has a time-varying position (xi (t); yi (t)), which is defined between a start frame and an end frame of the particle. By sweeping the video forward and then backward, a particle video can be constructed. Each particle is an image sample with a long duration trajectory and other characteristics.
Particle video improves the frame-to-frame optical flow by enhancing long-range appearance consistency and motion consistency.
Fig. 7 (a), (b), and (c) show time-dependent changes in the positions of pixels in the video determined based on the manner of feature point tracking, optical flow estimation, and particle-point video, respectively. In fig. 7, the horizontal axis represents time, and the vertical axis represents the abscissa of pixels in a screen of video. As shown in fig. 7 (b), optical flow estimation tends to result in an optical flow field that is excessively smooth but short in trajectory, an unrealistic motion texture, or speckles. As shown in fig. 7 (a), the feature point tracking method has a long but sparse extracted motion trajectory. As shown in (c) of fig. 7, the motion trajectories extracted by means of the particle video are dense and long. Therefore, the motion trail of the pixels can be extracted from the training video in a particle video mode.
The embodiment of the application avoids the drift or divergence of the generated video along with the time increase by representing the motion texture of each pixel in the image in the frequency domain, namely converting the first image I 0 into the image frequency domain data and determining the random motion texture according to the image frequency domain data.
In specific processing, the 10 th frame video and the following 149 frames video in the acquired real video can be used as training video to generate the motion trail of a plurality of pixel points. And removing the track with the motion estimation error, the track with the excessive motion amplitude caused by the camera motion and the track with the motion of the pixel points at all moments to obtain the filtered motion track of the plurality of pixel points. And obtaining the random motion textures of the labels according to the motion trail of the filtered multiple pixel points. The label random motion texture can be understood as the true value (ground truth) in the training process.
In general, motion textures have a specific distribution characteristic in frequency. The amplitude of the motion texture may range from 0 to 100 and decays approximately exponentially with increasing frequency. Since the output of the diffusion model is between 0 and 1, the model can be stably trained and denoised. Thus, the label random motion texture extracted from the real video may be normalized.
If the amplitude of the label random motion texture is scaled to 0,1 depending on the width and height of the image, then almost all coefficients will be close to zero at higher frequencies. Such training data would train out models that would result in inaccurate movements. When the amplitude of the normalized label random motion texture is very close to zero, even small prediction errors in the reasoning process can result in large relative errors after inverse normalization. To solve this problem, a simple but effective frequency-adaptive normalization technique can be employed. Specifically, the statistical data is calculated based on the motion textures that are not normalized, so that the fourier series of each frequency process is normalized independently.
The random motion texture S output by the random motion texture prediction model 412 may be a 2-dimensional motion spectrogram comprising 4K channels, where K is the number of frequencies in the random motion texture S and may also be understood as the number of fourier coefficients. At each frequency in the random motion texture S, the complex fourier transform coefficients in the x and y dimensions can be represented by four scalar quantities. The image frequency domain data may be represented in the form of an image, and x and y may represent coordinates in the image frequency domain data represented in the form of an image, respectively. In the image frequency domain data represented in the form of an image, the coordinates x and y can be represented by complex fourier transform coefficients. The complex fourier transform coefficient z can be expressed as z=x+iy, i being an imaginary unit.
Most natural oscillations and the like movements are mainly composed of low frequency components, and the frequency spectrum of the movements decreases exponentially with the increase of the frequency, which indicates that most natural oscillations and the like movements can be well represented by low frequency. In the case of k=16, the number of fourier coefficients is sufficient to truly reproduce the original natural motion in a series of videos and scenes. Thus, K may be greater than or equal to 16 and less than a preset threshold.
The stochastic motion texture prediction model 412 may be a neural network model, for example, may be a generative model, such as a diffusion model, or an LDM model, or the like.
In the training process of the random motion texture prediction model 412, the parameters of the initial random motion texture prediction model are adjusted according to the difference between the training random motion texture and the label random motion texture, which can be understood as adjusting the parameters of the initial random motion texture prediction model according to the difference between the training denoising data obtained by removing the gaussian noise each time and the label denoising data corresponding to the denoising. The label denoising data corresponding to each denoising is determined according to the random motion textures of the labels. The label denoising data corresponding to the last denoising is label random motion texture, and the label denoising data corresponding to other denoising is obtained by adding Gaussian noise on the basis of the label random motion texture. The degree of gaussian noise added to the tag denoising data for each denoising is different.
In the case where the sample image is the first frame image in the training video, the training samples in the training data of the random motion texture prediction model 412 may also include training target subject region information and/or training motion information.
In some embodiments, the random motion texture prediction model 412 may be a diffusion model. In the case where the random motion texture prediction model 412 is a diffusion model, the conditions may include image frequency domain data. The condition may further include target subject area information and/or motion information if the first image is an image to be processed.
In the training process of the random motion texture prediction model 412, the training samples are processed by using the initial random motion texture prediction model, which can be understood as that the noise frequency domain data is subjected to multiple gaussian noise removal based on the training samples by using the initial random motion texture prediction model, so as to obtain the training random motion texture.
Noise frequency domain data may be understood as noise of a frequency domain representation, i.e. a frequency domain representation of pure gaussian noise data. Training samples can be understood as conditions in the diffusion model training process.
The random motion texture S represents a frequency with four channels. A straightforward method of predicting a random motion texture S with K frequencies is to use a diffusion model as the random motion texture prediction model 412. The output of the diffusion model is a tensor with 4K channels. Training a model with a large number of output channels tends to result in excessive smoothing and reduced accuracy of the output. Another approach is to inject additional frequency parameters into the diffusion model to predict the motion spectrum of each frequency independently, but this can lead to uncorrelated predictions in the frequency domain, resulting in unrealistic actions. To avoid the above problems, the random motion texture prediction model 412 may employ LDM.
In other embodiments, the random motion texture prediction model 412 may be LDM.
Random motion texture prediction model 412 may also be LDM. LDM is more efficient than diffusion models while maintaining the quality of the generated image.
As shown in fig. 8 (a), LDM 500 may include two parts, encoder 510, diffusion model (diffusion modals, DM) 520, and decoder 530, respectively. The encoder 510 and decoder 530 may be a self-encoder and decoder, respectively, in a variant self-encoder (VAE).
The encoder 510 is used to compress, i.e., map, the input data to the hidden space. The hidden space may also be referred to as a potential space. Based on the result of the compression of the data by the encoder 510, a plurality of iterations are performed by using the diffusion model 520, and gaussian noise is removed in each iteration, so that the result of the plurality of iterations is denoising compressed data. The decoder 530 is configured to decode the de-noised compressed data to obtain de-noised data.
In the case where the random motion texture prediction model 412 employs the LDM 500, as shown in (a) of fig. 8, the encoder 510 in the LDM 500 is used to compress image frequency domain data to obtain image compression data. Compression of the image frequency domain data by the encoder 510 may also be understood as mapping the image frequency domain data to hidden space. The image compression data may also be referred to as frequency embedding (embedding). The data amount of the image compression data is smaller than the data amount of the image frequency domain data.
Encoder 510 in LDM 500 may also be configured to compress the noise frequency domain data to obtain compressed noise frequency domain data.
The diffusion model 520 in the LDM 500 is configured to perform denoising processing on the compressed noise frequency domain data for multiple times according to conditions, so as to obtain compressed data. The decoder 530 in LDM 500 is used to decode the compressed data to obtain the random motion texture S of the first image I 0. The conditions of the input diffusion model 520 include image compression data. The conditions of the input diffusion model 520 may also include a step index t.
That is, each denoising process of the diffusion model 520 is performed on the data obtained by the previous denoising process according to the conditions. The data obtained by the last denoising process can be understood as dataThe data obtained by the denoising process can be understood as/>. Data/>The noise frequency domain data after compression can be obtained by denoising the compressed noise frequency domain data for T-T times through the diffusion model 520, wherein T is the total number of times of denoising the diffusion model 520.
The noise frequency domain data may be derived by adding gaussian noise in the frequency domain to the motion texture.
In the case where the random motion texture prediction model 412 is LDM, the encoder and decoder in the random motion texture prediction model 412 may be pre-trained. The pre-training of the encoder and decoder, the training data may include tag data. Training data used in the pre-training process of the encoder and decoder may also be referred to as pre-training data.
In the pre-training process, the initial encoder can be utilized to compress the label data to obtain training compressed data, and the initial decoder is utilized to decompress the training compressed data to obtain training decompressed data. Based on the difference between the tag data and the training decompressed data, parameters of the initial encoder and the initial decoder are adjusted, the adjusted initial encoder may be used as the encoder 510 in the random motion texture prediction model 412, and the adjusted initial decoder may be used as the decoder 530 in the random motion texture prediction model 412. The difference may be represented by a loss value.
As shown in (b) of fig. 8, in order to obtain training data required for diffusion model training in LDM 500, tag data may be input to encoder 510 to obtain tag hidden space features without added noise. And then gradually adding Gaussian noise into the hidden space features to generate tag hidden variable features with different noise degrees. The training data includes label hidden space features without added noise and label hidden variable features with different noise degrees. The tag hidden variable feature with the highest noise degree can be understood as pure noise.
For LDM 500 applied to the image processing system of the present application, the label data may be label motion textures.
In the process of training the diffusion model in the LDM 500, the label hidden variable features with the highest noise degree are subjected to denoising treatment for multiple times by using the initial diffusion model, and a plurality of training hidden variable features are obtained in sequence.
The diffusion model performs denoising processing the same number of times as that of adding noise to the hidden space features. The plurality of training hidden variable features are in one-to-one correspondence with a plurality of tag hidden variable features other than the tag hidden variable feature having the highest noise level. The first obtained training hidden variable feature corresponds to the tag hidden variable feature which is the tag hidden variable feature with the highest noise degree in a plurality of tag hidden variable features except the tag hidden variable feature with the highest noise degree, and the last obtained training hidden variable feature corresponds to the tag hidden variable feature with the lowest noise degree. The noise level in the tag hidden variable feature corresponding to each training hidden variable feature decreases as the resulting sequence number of the plurality of training hidden variable features increases.
And adjusting parameters of the initial diffusion model according to the difference between each training hidden variable characteristic and the tag hidden variable characteristic corresponding to the training hidden variable characteristic, so as to obtain the diffusion model. The diffusion model is an initial diffusion model after parameter adjustment. The difference between each training hidden variable feature and the tag hidden variable feature corresponding to the training hidden variable feature can be represented by a loss value.
The condition may further include target subject area information and/or motion information if the first image is an image to be processed.
In the case where the random motion texture prediction model 412 is LDM, the process of generating a motion texture may also be understood as a process of frequency domain coordinated denoising.
In still other embodiments, the random motion texture prediction model 412 may include a compression model and a diffusion model.
The compression model may be used to compress the image frequency domain data to obtain image compressed data. The diffusion model may perform denoising processing on the noise frequency domain data multiple times according to conditions, so as to obtain a random motion texture S of the first image I 0. The conditions of the input diffusion model 520 may include image compression data and may also include a step index t.
In the multiple denoising processes, each denoising process of the diffusion model can be understood as removing Gaussian noise from input data according to conditions, and the obtained data can be used as input data of the next denoising process of the diffusion model. The input data of the first denoising process is noise frequency domain data, and the input data of the diffusion model in other denoising processes after the first denoising process is the output of the diffusion model of the last denoising process. The last denoising processing of the diffusion model obtains data which is the random motion texture S of the first image I 0.
The number of times of the multiple denoising process is T, and the noise frequency domain data can be understood as noise of the frequency domain representation. That is, the input of the first denoising process of the diffusion model, i.e., can be understood as a frequency domain representation of pure gaussian noise data.
In the form of dataTaking noise frequency domain data as an example, the ith denoising process of the diffusion model can be understood as the input data/>, according to the conditionGaussian noise removal to obtain data/>T may be denoted as T-i+1. Data/>, obtained by last denoising treatment of diffusion modelIs the random motion texture S of the first image I 0.
Where the random motion texture prediction model 412 includes a compression model and a diffusion model, the compression model may be pre-trained. The pre-training of the compression model may be referred to as pre-training of the encoder 510. The compression model may be an encoder.
In the case where the random motion texture prediction model 412 includes a compression model and a diffusion model, training of the random motion texture prediction model 412 may be understood as training of the diffusion model.
In the training process of the random motion texture prediction model 412, the training samples are processed by using an initial random motion texture prediction model, which can be understood that the training samples are compressed by using the VAE to obtain training image compressed data, and the noise frequency domain data is removed for multiple times based on the training samples by using the initial diffusion model, so as to obtain the training random motion texture. The training samples include training image compression data.
The parameters of the initial random motion texture prediction model are adjusted according to the difference between the training random motion texture and the label random motion texture, and the parameters of the initial random motion texture prediction model can be adjusted according to the difference between training denoising data obtained by removing Gaussian noise each time and label denoising data corresponding to the denoising. The label denoising data corresponding to each denoising is determined according to the random motion textures of the labels. The label denoising data corresponding to the last denoising is label random motion texture, and the label denoising data corresponding to other denoising is obtained by adding Gaussian noise on the basis of the label random motion texture. The degree of gaussian noise added to the tag denoising data for each denoising is different.
Training denoising data obtained by removing Gaussian noise each time can be used as a basis for removing Gaussian noise next time. That is, the first gaussian noise removal may be performed on the noise frequency domain data, and the other gaussian noise removal may be performed on the training denoising data obtained by the previous gaussian noise removal.
The second transformation module 413 processes the random motion texture S of the first image I 0. The random motion texture S of the first image I 0 is subjected to inverse fourier transform to obtain a random motion displacement field at least at one preset time after the first image I 0. In the random motion displacement field at each preset moment, the motion displacement field of each pixel p in the first image I 0 is obtained by performing inverse fourier transform on the frequency representation S p of the motion trail of the pixel.
The inverse fourier transform may be an inverse fast fourier transform. That is, the motion displacement field of each pixel pWherein/>Representing an inverse fast fourier transform. The random motion displacement field at preset time can be expressed asWhere P is the number of pixels in the first image I 0.
The position of each pixel in the random motion displacement field at the preset moment can be determined according to the motion displacement field of the pixel in the random motion displacement field at the preset moment.
Compared with the diffusion model for processing the image frequency domain data, the LDM 500 compresses the image frequency domain data and processes the compressed image data obtained by compression, so that the overall data processing amount can be reduced, and the processing efficiency can be improved.
The motion displacement field at the future time is predicted by using the generated diffusion model, and the moving main body can be controlled in a fine granularity. The motion textures are obtained by processing in the frequency domain, and the original RGB pixels are not directly processed, so that the calculation efficiency is improved, and the generated motion representations can be kept consistent for a long time.
Among the positions of the respective pixels at the next time determined from the random motion displacement field of the first image I 0, there may be coincidence in the positions of the plurality of pixels at the next time. In an image represented by the position of each pixel at the next time determined from the random motion displacement field of the first image I 0, there may be a position outside the position of each pixel p at the next time, that is, there may be a hole in the image determined from the position of each pixel at the next time determined from the random motion displacement field D of the first image I 0.
To avoid this, the features of the first image I 0 may be adjusted according to the features of the random motion displacement field D of the first image I 0, and a second image at the next time and audio at the next time may be generated according to the adjusted features. The audio at the next time may be understood as the audio corresponding to the second image at the next time.
The image feature extraction model 430 may be a convolutional neural network, including a plurality of convolutional layers. The first feature may include only the first sub-feature of the last convolutional layer output of the image feature extraction model 430 in the feature extraction of the first image I 0. Or the first features may include a plurality of first sub-features respectively output by a plurality of convolution layers in the image feature extraction model 430. The plurality of convolution layers outputting the plurality of first sub-features includes a last convolution layer of the image feature extraction model 430.
When a convolutional neural network has a plurality of convolutional layers, the initial convolutional layer tends to extract more general features, which may also be referred to as low-level features; as the depth of the convolutional neural network is deepened, features extracted by the convolutional layer further and further are more complex, such as features of high-level semantics, and features with higher semantics are more suitable for the problem to be solved.
In the case where the first feature includes a plurality of first sub-features respectively output from a plurality of convolution layers in the image feature extraction model 430, the first feature may be expressed asWherein G is the number of the first plurality of sub-features,/>And J is a positive integer greater than 1.
Therefore, the second image and the audio corresponding to the second image are generated according to the first characteristic, and the influence of more information is considered, so that the second image and the audio corresponding to the second image are more accurate.
The sizes of the plurality of first sub-features respectively output by the plurality of convolution layers in the image feature extraction model 430 are different. The plurality of first sub-features may also be understood as multi-scale features of the first image I 0 obtained by the image feature extraction model 430 multi-scale encoding the first image I 0. The image feature extraction model 430 may be understood as an encoder.
Of the plurality of first sub-features, the image feature extraction model 430 is smaller in size for the first sub-feature output by the backward convolutional layer. Thus, the image feature extraction model 430 may also be referred to as a feature pyramid extractor.
The motion feature extraction model 420 may be a convolutional neural network, including a plurality of convolutional layers. The motion features may include only the motion sub-features output by the last convolution layer in the feature extraction of the random motion displacement field D of the first image I 0 by the motion feature extraction model 420. Or the motion features may include a plurality of motion sub-features that are output by a plurality of convolution layers, respectively, in the motion feature extraction model 420. The plurality of convolution layers outputting the plurality of motion sub-features includes a last convolution layer of the motion feature extraction model 420.
The plurality of motion sub-features output by the plurality of convolution layers in the motion feature extraction model 420 are different in size. Of the plurality of motion sub-features, the smaller the size of the first sub-feature that the motion feature extraction model 420 outputs for the backward convolutional layer. Thus, the motion feature extraction model 420 may also be referred to as a feature pyramid extractor.
In the case where the first feature comprises a plurality of first sub-features, different first sub-features correspond to different movement sub-features. The size of each first sub-feature is the same as the size of the corresponding motion sub-feature of the first sub-feature.
The adjustment model 440 may adjust each first sub-feature according to the motion sub-feature corresponding to the first sub-feature to obtain a second sub-feature corresponding to the first sub-feature. The second feature includes a second sub-feature corresponding to each of the plurality of first sub-features.
Or the first feature may be obtained by stitching a plurality of first sub-features, and the motion sub-feature may be obtained by stitching a plurality of motion sub-features. That is, the result of stitching the plurality of motion sub-features, and the result of stitching the plurality of first sub-features may be input to the adjustment model 440, such that the adjustment model 440 may output the second feature.
From the plurality of motion sub-features, a weight corresponding to each pixel p in the first image I 0 may be determined. The weight for each pixel p may be positively correlated with the magnitude of the pixel p in the plurality of motion sub-features.
The weight corresponding to the pixel p can be expressed asWherein G is the number of the plurality of motion sub-features,/>Representing the amplitude of the pixel p in the G-th motion sub-feature, G being a positive integer and G being less than or equal to G. The amplitude of the motion sub-features may be different. The amplitude of the pixel p in the g-th motion sub-feature may be obtained by scaling the g-th motion sub-feature to the same amplitude as the first image. Scaling of motion sub-features may be achieved by interpolation.
The adjustment model 440 may obtain the second sub-feature according to the plurality of first sub-features, the plurality of motion sub-features, and the weights corresponding to the plurality of pixels in the first image I 0.
The adjustment model 440 may adjust the first feature using an activation function (activation function). The activation function, which may also be referred to as an excitation function, is a nonlinear twisting force in the overall structure of the model. The activation function is often a fixed, non-linear transformation without parameters that determines a range of values for the neuron output.
The adjustment of the first feature by the adjustment model 440 using the activation function may also be understood as the adjustment model 440 performing a warping calculation. Warping (Warping) is used to describe the process of mapping data into another data according to certain transformation rules. The transformed data is aligned or matched with the reference data by geometrically transforming the data. Common warping calculations include affine transformations, perspective transformations, and general nonlinear transformations.
The adjustment model 440 may be subjected to a direct average warp, a baseline depth warp, or the like.
Or the adjustment model 440 may process the weights corresponding to the plurality of first sub-features, the plurality of motion sub-features, and the plurality of pixels p in the first image I 0, respectively, using the activation function softmax to obtain the second feature. That is, the adjustment model 440 may utilize the activation function softmax to process the motion feature, weights corresponding to the plurality of pixels p in the first image I 0, and the first feature, to obtain the second feature. The second feature may also be referred to as a warp feature.
By adopting a distortion strategy in the image generation process, the occurrence of holes in the generated image is avoided, and the mapping of a plurality of original pixels to the same position is also avoided. The multi-scale technology is adopted, namely, the image is generated according to the output of a plurality of layers in the feature extraction model, the generated image is finer and more accurate, accurate prediction of the motion mode is facilitated, and a more real video is obtained.
The image generation model 450 and the audio generation model 460 may be generation models. The structure of the image generation model 450 and the audio generation model 460 may be the same or different.
The image generation model 450 and the audio generation model 460 respectively process the same second features to obtain a second image and audio corresponding to the second image. The same identification may be set for the resulting second image and audio generated from the same second feature, and different identifications may be set for different second images. That is, the second image and the audio corresponding to the second image may be provided with the same identification to represent the correspondence of the images and the audio. Therefore, in the process of playing the target video comprising a plurality of second images and the audio corresponding to each second image, under the condition that any second image is displayed, the audio corresponding to the second image can be determined according to the identification of the second image, and the large difference between the displayed image and the content of the played audio caused by inconsistent playing speeds of the image and the audio in the process of playing the target video is avoided.
The identification of the second image and the audio setting corresponding to the second image may also be understood as time embedding.
In image processing system 400, motion feature extraction model 420, image feature extraction model 430, image generation model 450, and audio generation model 460 are co-trained.
For training of the encoder 510, the initial encoder and the initial decoder may be trained using image training data. Encoder 510 may be a trained parameter-adjusted initial encoder.
For joint training of the motion feature extraction model 420, the image feature extraction model 430, the image generation model 450, and the audio generation model 460, the initial motion feature extraction model, the initial image generation model, and the initial audio generation model may be trained using joint training data. The motion feature extraction model 420 may be a trained parametric adjusted initial motion feature extraction model, the image feature extraction model 430 may be a trained parametric adjusted initial image feature extraction model, the image generation model 450 may be a trained parametric adjusted initial image generation model, and the audio generation model 460 may be a trained parametric adjusted initial audio generation model.
The joint training data may include training samples including sample images and training random motion displacement fields and tag data including tag images and tag audio. The initial motion feature extraction model is used for extracting features of the training random motion displacement field so as to obtain training motion features. The initial image feature extraction model is used for extracting features of the training image to obtain first training motion features. The adjustment model 440 is used for adjusting the first training feature according to the motion feature to obtain the second training feature. The initial image generation model is used for generating a training image according to the second training characteristics. The initial audio generation model is used for generating training audio according to the second training characteristics. Parameters of the initial motion feature extraction model, the initial image generation model, and the initial audio generation model are adjusted according to the differences between the training image and the tag image, and the differences between the tag audio and the training audio, to obtain a motion feature extraction model 420, an image feature extraction model 430, an image generation model 450, and an audio generation model 460. The difference may be represented by a loss value.
The image processing system provided by the embodiment of the application can generate a plurality of images and audios in the video according to the images. The generation of the voiced dynamic video may be achieved by processing the image according to the single Zhang Dai. The video may be wallpaper for the electronic device. In some cases, the generated video may be played in a seamless loop.
The image processing system can generate videos according to the indicated target main body of the user and the target motion trend of the target main body, so that the user can edit the motion, the participation of the user is improved, and the user experience is improved. The video generated according to the user's instruction can be understood as an interactive animation.
The training video used in the training process can be acquired by a camera. Therefore, according to the image processing system obtained through training of the training video, the generated video is high in reality, and dynamic simulation can be achieved. It should be understood that the user indicated trend of the target movement may be understood as an external force exerted on the target body.
Next, a training method of the image processing system used in the image processing method shown in fig. 3 will be described with reference to fig. 9.
Fig. 9 is a schematic flowchart of a training method of an image processing system according to an embodiment of the present application. The method shown in fig. 9 includes steps S910 to S930.
Step S910, a training sample, a tag image, and a tag audio are acquired.
The training samples may include sample images in the training video, the tag images may be images subsequent to the sample images in the training video, and the tag audio may be audio in the training video when the tag images in the training video are displayed.
In step S920, the training samples are processed by using the initial image processing system to obtain a training image and training audio.
Step S930, adjusting parameters of the initial image processing system according to the first difference between the training image and the tag image and the second difference between the training audio and the tag audio, and performing the adjusted initial image processing.
The system is an image processing system obtained through training.
Both the first difference and the second difference may be represented by a loss value. The first difference may be represented by a perceived loss, for example. The perceptual penalty may be a visual geometry group (visual geometry group, VGG) perceptual penalty, or a perceived image block similarity (learned perceptual IMAGE PATCH SIMILARITY, LPIPS), or the like.
In training the neural network model, because the output of the neural network model is as close to the value that is actually expected, the weight vector of each layer of the neural network can be updated by comparing the predicted value of the current network with the actual target value that is actually expected, and then according to the difference between the predicted value of the current network and the actual target value (of course, there is usually an initialization process before the first update, that is, the parameters are preconfigured for each layer in the neural network model), for example, if the predicted value of the model is higher, the weight vector is adjusted to make it predict lower, and the adjustment is continued until the neural network model can predict the actual target value or a value very close to the actual target value. Thus, it is necessary to define in advance "how to compare the difference between the predicted value and the target value", which is a loss function (loss function) or an objective function (objective function), which are important equations for measuring the difference between the predicted value and the target value. Taking the loss function as an example, the higher the output value of the loss function, i.e. the loss value (loss), the larger the difference, the training of the neural network model becomes a process of reducing the loss as much as possible.
The size of parameters in the initial neural network model can be corrected in the training process by adopting an error Back Propagation (BP) algorithm, so that the reconstruction error loss of the neural network model is smaller and smaller. Specifically, the input signal is transmitted forward until the output is generated with error loss, and the parameters in the initial neural network model are updated by back propagation of the error loss information, so that the error loss is converged. The back propagation algorithm is a back propagation motion that dominates the error loss, and is intended to derive parameters of the optimal neural network model, e.g., a weight matrix.
Optionally, the training sample may further include training time information, the training time information being used to represent a training time interval, and the tag image and the tag audio may be images that are in the training video after the sample image and at a training time interval from the sample image.
Optionally, the training sample may further include training subject area information, where the training subject area information indicates a training subject area, and a training target subject is recorded in the training subject area in the sample image, and content recorded in other areas in the label image outside the training subject area is the same as content recorded in the other areas in the sample image.
The training target subject may be a subject moving in a training video. For example, the training target subject may be a subject that moves at the beginning of the training video.
Optionally, in the case where the sample image is a first frame image in the training video, the training sample includes training subject region information.
It should be appreciated that the number of training samples for the image processing system is a plurality. In step S910, each training sample may be acquired, and a label image corresponding to the training sample, and a label audio corresponding to the training sample.
Optionally, the training samples may also include training exercise information. In the case where the training sample includes training motion information, the training target subject in the tag image is moved according to the training motion tendency represented by the training motion information, compared to the sample image. That is, the movement of the training target body conforms to the training movement tendency indicated by the training movement information.
Optionally, in the case where the sample image is a first frame image in the training video, the training sample includes training motion information.
Optionally, the initial image processing system comprises an initial feature prediction model, an initial image generation model, and an initial audio generation model.
The initial feature prediction model is used for processing the training samples to obtain training prediction features.
The initial image generation model is used for processing the training prediction features to obtain a training image.
The initial audio generation model is used to process the predicted features to obtain training audio.
The initial characteristic prediction model after parameter adjustment is a characteristic prediction model in the image processing system, the initial image generation model after parameter adjustment is an image generation model in the image processing system, and the initial audio generation model after parameter adjustment is an audio generation model in the image processing system.
Optionally, the initial feature prediction model includes a motion displacement field prediction model, an initial motion feature extraction model, an initial image feature extraction model, and an adjustment model.
The motion displacement field prediction model is used for processing training samples to obtain a training motion displacement field, and the training motion displacement field represents the displacement of a plurality of training pixels in a sample image at a second training moment corresponding to a label image relative to a first training moment corresponding to the sample image. The second training time corresponding to the label image and the first training time corresponding to the sample image can be respectively understood as the time of the label image and the sample image in the training video.
The initial motion feature extraction model is used for extracting features of the motion displacement field to obtain training motion features.
The initial image feature extraction model is used for extracting features of the sample image to obtain training image features.
The adjustment model is used for adjusting the training image characteristics according to the training motion characteristics so as to obtain training prediction characteristics.
The initial motion feature extraction model after parameter adjustment is a motion feature extraction model in an image processing system, and the initial image feature extraction model after parameter adjustment is an image feature extraction model in the image processing system.
The motion displacement field prediction model may be a pre-trained neural network model.
Or the initial feature prediction model may include an initial motion feature extraction model, an initial image feature extraction model, and an adjustment model.
The training samples are processed in a mode of feature tracking, optical flow extraction or particle video, and a training motion displacement field can be determined. The training motion displacement field obtained by processing the feature tracking, the optical flow extraction or the particle video can also be called an optical flow field.
Alternatively, the motion displacement field prediction model may include a first transform module, a motion texture prediction model, and a second transform module.
The first transformation module is used for carrying out Fourier transformation on the sample image so as to obtain training image frequency domain data.
The motion texture prediction model is used for processing the training image frequency domain data to generate a training motion texture.
The second transformation module is used for performing inverse Fourier transformation on the training motion textures to obtain a training motion displacement field.
Training motion texture may be understood as a frequency domain representation of the training motion displacement field. The frequency domain data may be converted into time domain data by inverse fourier transform.
In the motion displacement field prediction model, the motion displacement field prediction model may be pre-trained.
Alternatively, the motion texture prediction model may include a compression model and a motion texture generation model. That is, the motion texture prediction model may be an LDM model.
The compression model is used for compressing the training image frequency domain data to obtain training image compression data.
The motion texture generation model is used to process the training image compression data to generate a training motion texture.
In the case where the training sample includes training subject region information, the motion texture generation model is used to process the training image compression data and the training subject region information to generate a training motion texture.
In the case where the training sample includes training subject area information and training motion information, the motion texture generation model is used to process the training image compression data, the training subject area information, and the training motion information to generate a training motion texture.
The pre-training of the motion displacement field prediction model may include a first pre-training of the compression model and a second pre-training of the motion texture generation model.
The first pre-training may be trained using the first pre-training data to obtain a compressed model. In the first pre-training process, the first pre-training data may be compressed using an initial compression model to obtain pre-training compressed data, and the pre-training compressed data may be decompressed using an initial decompression model to obtain pre-training decompressed data. The data amount of the pre-training compressed data is smaller than the data amount of the first pre-training data. In the first pre-training process, parameters of the initial compression model and the initial decompression model can be adjusted according to the difference between the pre-training decompression data and the first pre-training data, and the initial compression model after parameter adjustment can be a compression model in the motion texture prediction model.
The first pre-training data may be an image, and the first pre-training data may also be frequency domain data obtained by performing fourier transform on the image.
It should be appreciated that in the first pre-training process, the amount of first pre-training data may be one or more.
The second pre-trained pre-training data may include pre-training image frequency domain data and label motion textures. The pre-training image frequency domain data may be obtained by fourier transforming the pre-training image in the pre-training video. The label motion texture may be fourier transformed from the pre-training displacement field, that is, the label motion texture may be understood as a frequency domain representation of the pre-training displacement field. The pretraining displacement field represents a displacement of a plurality of pixels in a pretraining image in the pretraining video in at least one image subsequent to the pretraining image in the pretraining video. The pre-training displacement field can be obtained by processing the pre-training video in a particle-point video mode.
It should be appreciated that the amount of pre-training data used in the second and training processes may be one or more.
In the second pre-training process, the pre-training image frequency domain data can be compressed by using the compression model to obtain pre-training image compression data, and the pre-training image compression data is processed by using the initial motion texture generation model to obtain the pre-training motion textures. According to the difference between the pre-training motion texture and the label motion texture, parameters of an initial motion texture generation model are adjusted, and the initial motion texture generation model after parameter adjustment can be a motion texture generation model in a motion texture prediction model.
The training data may or may not include training subject area information. In a second pre-training of the motion texture generation model, the subject area, or the influence of the subject area and the motion trend, is considered.
The second pre-trained pre-training data may include pre-training subject area information. The pre-training subject area information represents a pre-training subject area in which the pre-training subject is located in the pre-training image. In the pretraining displacement field represented by the label motion texture, the displacement of the pixels outside the pretraining body region is 0.
The second pre-trained pre-training data may also include pre-training subject region information and pre-training motion information. The pre-training exercise information represents a pre-training exercise trend of the pre-training subject. In the label motion pattern, the position of the pixel in the pre-training subject area indicates that the motion of the pre-training subject conforms to the pre-training motion trend.
Alternatively, the motion texture generation model may include a diffusion model and a decompression model. The decompression model may be an initial decompression model adjusted by the first pre-training.
The motion texture generation model is used for processing according to the training image compression data, and denoising the compressed frequency domain noise data for a plurality of times to obtain the training compression motion texture.
The compressed frequency domain noise data may be obtained by compressing the frequency domain noise data by a compression model, or may be preset.
The decompression model may be used to decompress the training compressed motion texture to obtain the training motion texture.
The initial motion texture generation model may include a decompression model and an initial diffusion model. In the second pre-training process of the motion texture generation model, the frequency domain noise data can be compressed by using the compression model, so as to obtain compressed frequency domain noise data. The initial motion texture generation model may be configured to perform multiple denoising processes on the compressed frequency domain noise data according to the compressed frequency domain noise data, to obtain a pre-trained compressed motion texture. The decompression model may be used to decompress the pre-trained compressed motion textures to obtain the pre-trained motion textures.
The image processing system trained by the method provided by the embodiment of the application can be applied to the image processing method shown in fig. 3.
The image processing method according to the embodiment of the present application is described in detail above with reference to fig. 3 to 9, and the apparatus embodiment of the present application will be described in detail below with reference to fig. 10 and 11. It should be understood that the image processing apparatus in the embodiment of the present application may perform the various image processing methods in the foregoing embodiment of the present application, that is, specific working procedures of the following various products may refer to corresponding procedures in the foregoing method embodiment.
Fig. 10 is a schematic block diagram of a system architecture according to an embodiment of the present application.
As shown in the system architecture 1100, the data acquisition device 1160 is configured to acquire training data, where in an embodiment of the present application, the training data includes: training samples and tag images, tag audio. The training samples may include sample images in a training video. The label image may be an image subsequent to the sample image in the training video and the label audio may be audio in the training video when the label image in the training video is displayed.
The data acquisition device 1160 may also be used to acquire the first pre-training data and the second pre-training data.
In the embodiment of the application, the first pre-training data is an image, and may also be frequency domain data obtained by performing fourier transform on the image.
The second pre-training data in the embodiment of the application may include pre-training image frequency domain data and label motion textures. The second pre-training data may further include pre-training subject area information. The second pre-training data may also include pre-training motion information.
The training data is stored in a database 1130 and the training device 1120 trains the target model/rule 1101 based on the training data maintained in the database 1130.
In the embodiment provided by the application, the image processing system is obtained through training. A detailed description of how the training device 1120 obtains the target model/rule 1101 based on training data may be found in the description of fig. 9. That is, the training apparatus 1120 may be used to perform the training method of the image processing system shown in fig. 9.
The object model/rule 1101 may be the image processing system 400 shown in fig. 4. The target model/rule 1101 can be used to implement the image processing method shown in fig. 3 according to the embodiment of the present application, that is, the image to be processed is input into the target model/rule 1101, and a target video can be obtained. The object model/rule 1101 in the embodiment of the present application may be specifically an image processing system.
In practical applications, the training data maintained in the database 1130 is not necessarily collected by the data collection device 1160, but may be received from other devices. It should be noted that the training device 1120 is not necessarily completely based on the training data maintained by the database 1130 to perform training of the target model/rule 1101, and it is also possible to obtain the training data from the cloud or other places to perform model training, and the above description should not be taken as limitation of the embodiment of the present application.
The target model/rule 1101 obtained by training according to the training device 1120 may be applied to different systems or devices, such as the execution device 1110 shown in fig. 10, where the execution device 1110 may be a terminal, such as a mobile phone terminal, a tablet computer, a notebook computer, an AR/VR, a vehicle-mounted terminal, etc., and may also be a server or cloud terminal, etc. In fig. 10, the executing device 1110 is configured with an I/O interface 1112, for performing data interaction with an external device, and a user may input data to the I/O interface 1112 through the client device 1140, where the input data may include an image to be processed, and may further include a target subject indicated by the user in the image to be processed, a target motion trend of the target subject indicated by the user, and so on in an embodiment of the present application.
The preprocessing module 1113 and the preprocessing module 1114 are used for preprocessing according to input data (such as an image to be processed) received by the I/O interface 1112.
In the embodiment of the present application, the preprocessing module 1113 and the preprocessing module 1114 (or only one preprocessing module may be used) may be omitted, and the calculation module 1111 may be directly used to process the input data.
In preprocessing input data by the execution device 1110, or in performing processing related to computation or the like by the computation module 1111 of the execution device 1110, the execution device 1110 may call data, code or the like in the data storage system 1150 for corresponding processing, or may store data, instructions or the like obtained by corresponding processing in the data storage system 1150.
Finally, the I/O interface 1112 returns the processing results, such as the deblurred raw image file obtained as described above, to the client device 1140 for presentation to the user.
It should be noted that the training device 1120 may generate, based on different training data, a corresponding target model/rule 1101 for different targets or different tasks, and the corresponding target model/rule 1101 may be used to achieve the targets or complete the tasks, thereby providing the user with the desired result.
In the case shown in FIG. 10, a user may manually give input data, which may be manipulated through an interface provided by I/O interface 1112. In another case, the client device 1140 may automatically send input data to the I/O interface 1112, and if the client device 1140 is required to automatically send input data requiring authorization from the user, the user may set the corresponding permissions in the client device 1140. The user may view the results output by the execution device 1110 at the client device 1140, and the particular presentation may be in the form of a display, sound, action, or the like. The client device 1140 may also be used as a data collection terminal to collect input data from the input I/O interface 1112 and output data from the output I/O interface 1112 as shown as new sample data, and store the new sample data in the database 1130. Of course, instead of being collected by the client device 1140, the I/O interface 1112 may directly store the input data of the input I/O interface 1112 and the output result of the output I/O interface 1112 as new sample data into the database 1130.
Fig. 11 to 14 show application scenarios of the image generation method provided by the embodiment of the present application.
Fig. 11 (a) shows a graphical user interface (GRAPHICAL USER INTERFACE, GUI) of an electronic device, which may be the client device 1140, the GUI being the desktop 1010 of the electronic device. Upon detecting an album start operation in which the user clicks an icon 1011 of an album Application (APP) on the desktop 1010, the electronic device may start the album application, displaying another GUI as shown in (b) in fig. 11. The GUI shown in fig. 11 (b) may be referred to as a first album interface 1020. Album interface 1020 may include a plurality of thumbnails.
In the event that a user is detected to click on any of the thumbnails on the album interface 1020, the electronic device may display a GUI shown in fig. 12 (a), which may be referred to as a second album interface 1030. The second album interface 1030 may include a pending image 1032 corresponding to the video generation icon 1031 and the thumbnail clicked by the user. In the case where it is detected that the user clicks the video generation icon 1031, a video editing interface 1040 as shown in (b) in fig. 12 may be displayed. The video editing interface 1040 includes the image 1032 to be processed and includes the text information "please select a target and indicates the movement of the target".
The user's finger may tap the target subject and slide on the screen. The subject where the user is lightly touched is the target subject.
The direction in which the user's finger slides may be the initial direction of movement of the target subject. Or the position where the user's finger stops sliding may be the position where the movement amplitude of the target subject is maximum. Or the speed of the sliding of the finger of the user is the initial speed or the maximum speed of the target main body. The target motion trend of the target subject may include one or more of a starting motion direction of the target subject, a position when a motion amplitude is maximum, a starting speed or a maximum speed, and the like.
In the case that the end of the sliding of the finger of the user is detected, the electronic device may execute the method shown in fig. 3 to generate the target video. After the target video is generated, the electronic device may display the video interface 1050 shown in fig. 13. The video interface 1050 includes a play icon 1051 of the target video. In the event that a user click on the play icon 1051 is detected, the electronic device may play the target video.
In other embodiments, the user may also input the target movement trend of the target subject by voice or writing.
Fig. 14 shows a graphical user interface (GRAPHICAL USER INTERFACE, GUI) of an electronic device, which is a lock screen interface 1410 of the electronic device. The wallpaper image displayed by lock screen interface 1410 may be understood as the image to be processed.
In the case that the user clicks any one of the subjects in the image to be processed is detected, the electronic device may execute the method shown in fig. 3 with the subject clicked by the user as the target subject, generate the target video, and display the target video.
Or under the condition that the sliding of the user on the display screen is detected to be finished, the electronic equipment can take the main body where the sliding starting position is located as a target main body, generate a target video according to the target movement trend expressed by the sliding operation action, and display the target video.
Therefore, by the image processing method provided by the application, the application of personalized dynamic wallpaper can be generated, and the audio corresponding to each image in the dynamic wallpaper can be generated and played simultaneously.
The moving target subject in the dynamic wallpaper may be indicated by a user, and the moving of the target subject may be performed according to the indication of the user.
That is, the user does not need much expertise, and can see the desired dynamic sound wallpaper by simply selecting an exquisite picture from the gallery as the wallpaper, clicking on the moving body of interest in the picture, and dragging the body appropriately.
It should be noted that fig. 10 is only a schematic diagram of a system architecture provided by an embodiment of the present invention, and the positional relationship among devices, apparatuses, modules, etc. shown in the drawings is not limited in any way, for example, in fig. 10, the data storage system 1150 is an external memory with respect to the execution device 1110, and in other cases, the data storage system 1150 may also be disposed in the execution device 1110.
As shown in fig. 10, a target model/rule 1101 is trained according to a training device 1120, and the target model/rule 1101 may be an image processing system in an embodiment of the present application. Specifically, the image processing system provided by the embodiment of the application can comprise a feature prediction model, an image generation model and an audio generation model. The feature prediction model comprises a motion displacement field prediction model, a motion feature extraction model, an image feature extraction model and an adjustment model. The image generation model, the audio generation model, the motion displacement field prediction model, the motion feature extraction model and the image feature extraction model in the feature prediction model can be convolutional neural networks.
As described in the foregoing description of the basic concepts, the convolutional neural network is a deep neural network with a convolutional structure, and is a deep learning (DEEP LEARNING) architecture, where the deep learning architecture refers to learning at multiple levels at different levels of abstraction through machine learning algorithms. As a deep learning architecture, CNN is a feed-forward artificial neural network in which individual neurons can respond to data input thereto.
The motion displacement field prediction model may be pre-trained. The training device 1120 may train to obtain a motion displacement field prediction model based on the first pre-training data and the second pre-training data maintained in the database 1130.
Fig. 15 is a schematic diagram of an image processing apparatus according to an embodiment of the present application.
The image processing apparatus 1500 includes an acquisition unit 1510 and a processing unit 1520.
In some embodiments, the image processing apparatus 1500 may perform the image processing method shown in fig. 3. The image processing apparatus 1500 may be an execution device 1110.
The acquisition unit 1510 is configured to acquire an image to be processed.
The processing unit 1520 is configured to generate N target images in a target video and N target audios in the target video according to the image to be processed, where the N target images are in one-to-one correspondence with the N target audios, N is an integer greater than 1, and each target audio is used for playing when the target image corresponding to the target audio is displayed.
Optionally, the processing unit 1520 is configured to sequentially process, by using an image processing system, a plurality of first images to obtain at least one second image corresponding to each first image and the target audio corresponding to each second image, where the plurality of target images includes at least one second image corresponding to each first image, the at least one second image corresponding to each first image is an image after the first image in the target video, a first image processed by the image processing system is the image to be processed, an ith first image processed by the image processing system is an image in at least one second image corresponding to the i-1 th first image processed by the image processing system, i is a positive integer greater than 1, and the image processing system includes a neural network model obtained by training.
Optionally, the obtaining unit 1510 is further configured to obtain a target subject indicated by a user in the image to be processed.
And when the first image processed by the image processing system is the image to be processed, the image processing system is used for processing the first image and target main body area information to obtain at least one second image corresponding to the first image and the target audio corresponding to each second image, the target main body area information represents a target main body area, the target main body is recorded in the target main body area in the first image, and the content recorded in other areas except the target main body area in each second image is the same as the content recorded in the other areas in the first image.
Optionally, in the case that the first image processed by the image processing system is the image to be processed, the image processing system is configured to process the first image and target subject area information to obtain at least one second image corresponding to the first image and the target audio corresponding to each second image.
Optionally, the obtaining unit 1510 is further configured to obtain a target movement trend of the target subject indicated by the user.
And under the condition that the first image processed by the image processing system is the image to be processed, the image processing system is used for processing the first image, target main body area information and motion information to obtain the target image corresponding to each second moment in the at least one second moment and the target audio corresponding to each second moment, and the target main body in at least one target image corresponding to the at least one second moment moves according to the target motion trend represented by the motion information.
Optionally, the image processing system comprises a feature prediction model, an image generation model and an audio generation model.
The feature prediction model is used for processing the first image to obtain at least one prediction feature, and second moments corresponding to different prediction features are different, wherein each second moment is a moment after a first moment corresponding to the first image in the target video.
The image generation model is used for respectively processing the at least one prediction feature to obtain a second image corresponding to each second moment.
The audio generation model is used for respectively processing the at least one prediction feature to obtain target audio corresponding to each second moment.
Optionally, the feature prediction model includes a motion displacement field prediction model, a motion feature extraction model, an image feature extraction model, and an adjustment model.
The motion displacement field prediction model is used for processing the first image to obtain a motion displacement field corresponding to each second moment, and the motion displacement field corresponding to each second moment represents the displacement of a plurality of pixels in the first image at the second moment relative to the first moment.
The motion feature extraction model is used for extracting features of the at least one motion displacement field respectively to obtain motion features corresponding to each second moment.
The image feature extraction model is used for extracting features of the first image to obtain image features.
The adjustment model is used for adjusting the image characteristics according to the motion characteristics corresponding to each second moment so as to obtain the prediction characteristics corresponding to the second moment.
Optionally, the obtaining unit 1510 is further configured to obtain a target subject indicated by a user in the image to be processed.
The motion displacement field prediction model is used for processing the first image and target main body area information to obtain the motion displacement field corresponding to each second moment, the target main body area information represents a target main body area, and the target main body is recorded in the target main body area in the image to be processed.
And the motion displacement field corresponding to each second moment represents that the displacement of the pixel outside the area is 0, and the pixel outside the area is positioned outside the target main body area in the first image.
Optionally, the obtaining unit 1510 is further configured to obtain a target movement trend of the target subject indicated by the user.
And under the condition that the first image processed by the image processing system is the image to be processed, the motion displacement field prediction model is used for processing the first image, the target main body area information and the motion information to obtain the motion displacement field corresponding to each second moment, the target motion trend represented by the motion displacement field corresponding to each second moment is met by the target main body pixel corresponding to each second moment, and the target main body pixel is the pixel positioned on the target main body in the first image.
In other embodiments, image processing apparatus 1500 may perform the training method of the image processing system shown in FIG. 9. The image processing apparatus 1500 may be a training device 1120.
The acquisition unit 1510 is configured to acquire a training sample, a tag image, and tag audio.
The processing unit 1520 is configured to process the training sample by using the initial image processing system to obtain a training image and training audio.
The processing unit 1520 is further configured to adjust parameters of the initial image processing system according to the first difference between the training image and the tag image and the second difference between the training audio and the tag audio, where the adjusted initial image processing system is a trained image processing system.
The image processing apparatus 1500 is embodied as a functional unit. The term "unit" herein may be implemented in software and/or hardware, without specific limitation.
For example, a "unit" may be a software program, a hardware circuit or a combination of both that implements the functions described above. The hardware circuitry may include Application Specific Integrated Circuits (ASICs), electronic circuits, processors (e.g., shared, proprietary, or group processors, etc.) and memory for executing one or more software or firmware programs, merged logic circuits, and/or other suitable components that support the described functions.
Thus, the elements of the examples described in the embodiments of the present application can be implemented in electronic hardware, or in a combination of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The application also provides a chip comprising a data interface and one or more processors. When the one or more processors execute the instructions, the one or more processors read the instructions stored on the memory through the data interface to implement the image processing method and/or the training method of the image processing system described in the above method embodiments.
The one or more processors may be general purpose processors or special purpose processors. For example, the one or more processors may be a central processing unit (central processing unit, CPU), digital signal processor (DIGITAL SIGNAL processor, DSP), application Specific Integrated Circuit (ASIC), field programmable gate array (field programmable GATE ARRAY, FPGA), or other programmable logic device, such as discrete gates, transistor logic, or discrete hardware components.
The chip may be part of a terminal device or other electronic device. For example, the chip may be located in the electronic device 100.
The processor and the memory may be provided separately or may be integrated. For example, the processor and memory may be integrated on a System On Chip (SOC) of the terminal device. That is, the chip may also include a memory.
The memory may have a program stored thereon, the program being executable by the processor to generate instructions such that the processor performs the image processing method and/or the training method of the image processing system described in the above method embodiments according to the instructions.
Optionally, the memory may also have data stored therein. Alternatively, the processor may also read data stored in the memory, which may be stored at the same memory address as the program, or which may be stored at a different memory address than the program.
The memory may be used to store a related program of the image processing method provided in the embodiment of the present application, and the processor may be used to call the related program of the image processing method stored in the memory to implement the image processing method of the embodiment of the present application.
For example, the processor may be used to acquire an image to be processed; generating N target images in a target video and N target audios in the target video according to the image to be processed, wherein the N target images are in one-to-one correspondence with the N target audios, N is an integer greater than 1, and each target audio is used for playing under the condition that the target image corresponding to the target audio is displayed.
The memory may be used to store a program related to a training method of the image processing system provided in the embodiment of the present application, and the processor may be used to call the program related to the training method of the image processing system stored in the memory, so as to implement the training method of the image processing system of the embodiment of the present application.
For example, the processor may be used to obtain training samples and tag images, tag audio; processing the training sample by using an initial image processing system to obtain a training image and training audio; and adjusting parameters of the initial image processing system according to the first difference between the training image and the label image and the second difference between the training audio and the label audio, wherein the adjusted initial image processing system is the image processing system obtained by training.
The chip may be provided in an electronic device.
The application also provides a computer program product which, when executed by a processor, implements the image processing method and/or the training method of the image processing system according to any of the method embodiments of the application.
The computer program product may be stored in a memory, for example, as a program that is ultimately converted into an executable object file that can be executed by a processor through preprocessing, compiling, assembling, and linking processes.
The application also provides a computer readable storage medium having stored thereon a computer program which when executed by a computer implements the image processing method and/or the training method of the image processing system according to any of the method embodiments of the application. The computer program may be a high-level language program or an executable object program.
The computer readable storage medium is, for example, a memory. The memory may be volatile memory or nonvolatile memory, or the memory may include both volatile and nonvolatile memory. The nonvolatile memory may be a read-only memory (ROM), a Programmable ROM (PROM), an erasable programmable ROM (erasable PROM), an electrically erasable programmable EPROM (EEPROM), or a flash memory. The volatile memory may be random access memory (random access memory, RAM) which acts as external cache memory. By way of example, and not limitation, many forms of RAM are available, such as static random access memory (STATIC RAM, SRAM), dynamic random access memory (DYNAMIC RAM, DRAM), synchronous Dynamic Random Access Memory (SDRAM), double data rate synchronous dynamic random access memory (double DATA RATE SDRAM, DDR SDRAM), enhanced synchronous dynamic random access memory (ENHANCED SDRAM, ESDRAM), synchronous link dynamic random access memory (SYNCHLINK DRAM, SLDRAM), and direct memory bus random access memory (direct rambus RAM, DR RAM).
The embodiments of the present application may involve the use of user data, and in practical applications, user-specific personal data may be used in the schemes described herein within the scope allowed by applicable laws and regulations under conditions that meet applicable legal and regulatory requirements of the country where the user explicitly agrees (e.g., practical notification to the user, etc.).
In the description of the present application, the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance, as well as a particular order or sequence. The specific meaning of the above terms in the present application will be understood in specific cases by those of ordinary skill in the art.
In the present application, "at least one" means one or more, and "a plurality" means two or more. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one (one) of a, b, c may represent: a, b, c, a-b, a-c, b-c, or a-b-c, wherein a, b, c may be single or plural.
It should be understood that, in various embodiments of the present application, the sequence numbers of the foregoing processes do not mean the order of execution, and the order of execution of the processes should be determined by the functions and internal logic thereof, and should not constitute any limitation on the implementation process of the embodiments of the present application.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, and are not repeated herein.
In the several embodiments provided by the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the device embodiments described above are merely illustrative; for example, the division of the units is only one logic function division, and other division modes can be adopted in actual implementation; for example, multiple units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.
The foregoing is merely illustrative of the present application, and the present application is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.
Claims (12)
1. An image processing method, the method comprising:
Acquiring an image to be processed;
Generating N target images in a target video and N target audios in the target video according to the image to be processed, wherein the N target images are in one-to-one correspondence with the N target audios, N is an integer greater than 1, and each target audio is used for playing under the condition that the target image corresponding to the target audio is displayed.
2. The method of claim 1, wherein generating N target images in a target video and N target audio in the target video from the image to be processed comprises:
And processing the plurality of first images sequentially by using an image processing system to obtain at least one second image corresponding to each first image and the target audio corresponding to each second image, wherein the N target images comprise at least one second image corresponding to each first image, the at least one second image corresponding to each first image is an image after the first image in the target video, the first image processed by the image processing system is the image to be processed, the ith first image processed by the image processing system is an image in at least one second image corresponding to the ith-1 first image processed by the image processing system, i is a positive integer greater than 1, and the image processing system comprises a neural network model obtained through training.
3. The method according to claim 2, wherein the method further comprises: acquiring a target main body indicated by a user in the image to be processed;
And when the first image processed by the image processing system is the image to be processed, the image processing system is used for processing the first image and target main body area information to obtain at least one second image corresponding to the first image and the target audio corresponding to each second image, the target main body area information represents a target main body area, the target main body is recorded in the target main body area in the first image, and the content recorded in other areas except the target main body area in each second image is the same as the content recorded in the other areas in the first image.
4. A method according to claim 3, wherein, in the case where the first image processed by the image processing system is the image to be processed, the image processing system is configured to process the first image and target subject area information to obtain at least one second image corresponding to the first image, and the target audio corresponding to each second image.
5. The method according to claim 3 or 4, characterized in that the method further comprises: acquiring a target movement trend of the target main body indicated by the user;
And under the condition that the first image processed by the image processing system is the image to be processed, the image processing system is used for processing the first image, target main body area information and motion information to obtain the target image corresponding to each second moment in the at least one second moment and the target audio corresponding to each second moment, and the target main body in at least one target image corresponding to the at least one second moment moves according to the target motion trend represented by the motion information.
6. The method of claim 2, wherein the image processing system comprises a feature prediction model, an image generation model, and an audio generation model;
The feature prediction model is used for processing the first image to obtain at least one prediction feature, and second moments corresponding to different prediction features are different, wherein each second moment is a moment after a first moment corresponding to the first image in the target video;
The image generation model is used for respectively processing the at least one prediction feature to obtain a second image corresponding to each second moment;
the audio generation model is used for respectively processing the at least one prediction feature to obtain target audio corresponding to each second moment.
7. The method of claim 6, wherein the feature prediction model comprises a motion displacement field prediction model, a motion feature extraction model, an image feature extraction model, and an adjustment model;
The motion displacement field prediction model is used for processing the first image to obtain a motion displacement field corresponding to each second moment, and the motion displacement field corresponding to each second moment represents the displacement of a plurality of pixels in the first image relative to the first moment at the second moment;
The motion feature extraction model is used for extracting features of the at least one motion displacement field respectively to obtain motion features corresponding to each second moment;
The image feature extraction model is used for extracting features of the first image to obtain image features;
The adjustment model is used for adjusting the image characteristics according to the motion characteristics corresponding to each second moment so as to obtain the prediction characteristics corresponding to the second moment.
8. The method of claim 7, wherein the method further comprises: acquiring a target main body indicated by a user in the image to be processed;
The motion displacement field prediction model is used for processing the first image and target main body area information to obtain the motion displacement field corresponding to each second moment, the target main body area information represents a target main body area, and the target main body is recorded in the target main body area in the image to be processed;
And the motion displacement field corresponding to each second moment represents that the displacement of the pixel outside the area is 0, and the pixel outside the area is positioned outside the target main body area in the first image.
9. The method of claim 8, wherein the method further comprises: acquiring a target movement trend of the target main body indicated by a user;
And under the condition that the first image processed by the image processing system is the image to be processed, the motion displacement field prediction model is used for processing the first image, the target main body area information and the motion information to obtain the motion displacement field corresponding to each second moment, the target motion trend represented by the motion displacement field corresponding to each second moment is met by the target main body pixel corresponding to each second moment, and the target main body pixel is the pixel positioned on the target main body in the first image.
10. An electronic device, the electronic device comprising: one or more processors, and memory;
The memory being coupled to the one or more processors, the memory being for storing computer program code comprising computer instructions that the one or more processors invoke to cause the electronic device to perform the method of any of claims 1-9.
11. A chip system for application to an electronic device, the chip system comprising one or more processors to invoke computer instructions to cause the electronic device to perform the method of any of claims 1 to 9.
12. A computer readable storage medium comprising instructions that, when run on an electronic device, cause the electronic device to perform the method of any one of claims 1 to 9.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410339775.5A CN118101856A (en) | 2024-03-25 | 2024-03-25 | Image processing method and electronic equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410339775.5A CN118101856A (en) | 2024-03-25 | 2024-03-25 | Image processing method and electronic equipment |
Publications (1)
Publication Number | Publication Date |
---|---|
CN118101856A true CN118101856A (en) | 2024-05-28 |
Family
ID=91154833
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202410339775.5A Pending CN118101856A (en) | 2024-03-25 | 2024-03-25 | Image processing method and electronic equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN118101856A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118651557A (en) * | 2024-08-19 | 2024-09-17 | 苏州市伏泰信息科技股份有限公司 | Dustbin classification supervision method |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160148057A1 (en) * | 2014-11-26 | 2016-05-26 | Hanwha Techwin Co., Ltd. | Camera system and operating method of the same |
CN115061770A (en) * | 2022-08-10 | 2022-09-16 | 荣耀终端有限公司 | Method and electronic device for displaying dynamic wallpaper |
CN115359156A (en) * | 2022-07-31 | 2022-11-18 | 荣耀终端有限公司 | Audio playing method, device, equipment and storage medium |
CN116071248A (en) * | 2021-11-02 | 2023-05-05 | 华为技术有限公司 | Image processing method and related equipment |
CN116781992A (en) * | 2023-06-27 | 2023-09-19 | 北京爱奇艺科技有限公司 | Video generation method, device, electronic equipment and storage medium |
CN117177025A (en) * | 2023-08-14 | 2023-12-05 | 科大讯飞股份有限公司 | Video generation method, device, equipment and storage medium |
-
2024
- 2024-03-25 CN CN202410339775.5A patent/CN118101856A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160148057A1 (en) * | 2014-11-26 | 2016-05-26 | Hanwha Techwin Co., Ltd. | Camera system and operating method of the same |
CN116071248A (en) * | 2021-11-02 | 2023-05-05 | 华为技术有限公司 | Image processing method and related equipment |
CN115359156A (en) * | 2022-07-31 | 2022-11-18 | 荣耀终端有限公司 | Audio playing method, device, equipment and storage medium |
CN115061770A (en) * | 2022-08-10 | 2022-09-16 | 荣耀终端有限公司 | Method and electronic device for displaying dynamic wallpaper |
CN116781992A (en) * | 2023-06-27 | 2023-09-19 | 北京爱奇艺科技有限公司 | Video generation method, device, electronic equipment and storage medium |
CN117177025A (en) * | 2023-08-14 | 2023-12-05 | 科大讯飞股份有限公司 | Video generation method, device, equipment and storage medium |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118651557A (en) * | 2024-08-19 | 2024-09-17 | 苏州市伏泰信息科技股份有限公司 | Dustbin classification supervision method |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Anantrasirichai et al. | Artificial intelligence in the creative industries: a review | |
Alqahtani et al. | Applications of generative adversarial networks (gans): An updated review | |
Hussain et al. | A real time face emotion classification and recognition using deep learning model | |
Seow et al. | A comprehensive overview of Deepfake: Generation, detection, datasets, and opportunities | |
Ferreira et al. | Learning to dance: A graph convolutional adversarial network to generate realistic dance motions from audio | |
US11475276B1 (en) | Generating more realistic synthetic data with adversarial nets | |
US20210335029A1 (en) | Controllable image generation | |
CN118101856A (en) | Image processing method and electronic equipment | |
US20210272292A1 (en) | Detection of moment of perception | |
WO2023129190A1 (en) | Generative modeling of three dimensional scenes and applications to inverse problems | |
CN113542759A (en) | Generating antagonistic neural network assisted video reconstruction | |
CN113542758A (en) | Generating antagonistic neural network assisted video compression and broadcast | |
CN117576248B (en) | Image generation method and device based on gesture guidance | |
Huttunen | Deep neural networks: A signal processing perspective | |
CN118251698A (en) | Novel view synthesis of robust NERF model for sparse data | |
Sun et al. | Learning adaptive patch generators for mask-robust image inpainting | |
Gaur et al. | Deep learning techniques for creation of deepfakes | |
Yadav et al. | End-to-end bare-hand localization system for human–computer interaction: a comprehensive analysis and viable solution | |
CN113408694A (en) | Weight demodulation for generative neural networks | |
Angelopoulou et al. | Evaluation of different chrominance models in the detection and reconstruction of faces and hands using the growing neural gas network | |
Wang et al. | Video emotion recognition using local enhanced motion history image and CNN-RNN networks | |
CN115482557A (en) | Human body image generation method, system, device and storage medium | |
Molnár et al. | Variational autoencoders for 3D data processing | |
Ma et al. | Dance action generation model based on recurrent neural network | |
Liu et al. | Confusable facial expression recognition with geometry-aware conditional network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |