CN115601612A - Image prediction model training method and device, electronic equipment and storage medium - Google Patents
Image prediction model training method and device, electronic equipment and storage medium Download PDFInfo
- Publication number
- CN115601612A CN115601612A CN202211318456.3A CN202211318456A CN115601612A CN 115601612 A CN115601612 A CN 115601612A CN 202211318456 A CN202211318456 A CN 202211318456A CN 115601612 A CN115601612 A CN 115601612A
- Authority
- CN
- China
- Prior art keywords
- image
- image sequence
- sequence
- training
- prediction model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000012549 training Methods 0.000 title claims abstract description 93
- 238000000034 method Methods 0.000 title claims abstract description 63
- 230000006870 function Effects 0.000 claims description 19
- 238000004590 computer program Methods 0.000 claims description 16
- 238000012545 processing Methods 0.000 claims description 10
- 230000003042 antagnostic effect Effects 0.000 claims 1
- 238000010586 diagram Methods 0.000 description 9
- 238000004891 communication Methods 0.000 description 8
- 230000008569 process Effects 0.000 description 4
- 230000003287 optical effect Effects 0.000 description 3
- 230000009286 beneficial effect Effects 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 238000003491 array Methods 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000001413 cellular effect Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 239000011521 glass Substances 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 239000013307 optical fiber Substances 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 230000001953 sensory effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/774—Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Evolutionary Computation (AREA)
- Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- Computing Systems (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Software Systems (AREA)
- Databases & Information Systems (AREA)
- Medical Informatics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Multimedia (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Molecular Biology (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses an image prediction model training method and device, electronic equipment and a storage medium. The method comprises the following steps: acquiring an image sequence sample and an image sequence label corresponding to the image sequence sample; inputting the image sequence samples into a generator for generating a countermeasure network to obtain a predicted image sequence; inputting the predicted image sequence and the image sequence label corresponding to the image sequence sample into a discriminator for generating the countermeasure network to obtain a discrimination result, training the countermeasure network based on the discrimination result until a training stop condition is met, and obtaining an image prediction model; the generator and the discriminator comprise three-dimensional convolution and three-dimensional deconvolution, the three-dimensional convolution and the three-dimensional deconvolution are used for extracting image time sequence information from input information, and the image time sequence information comprises frame information of images. According to the technical scheme, the capturing capacity of the image time sequence information of the image prediction model is improved by using the three-dimensional convolution and the three-dimensional deconvolution, so that the semantic information of the predicted image is more clear.
Description
Technical Field
The present invention relates to the field of image processing technologies, and in particular, to a method and an apparatus for training an image prediction model, an electronic device, and a storage medium.
Background
In recent years, deep learning has been widely developed and applied in academic and industrial fields, and has achieved great results in many fields such as computer vision, speech recognition, natural language processing, and the like.
The prediction of future frames of video based on video frame feature information is a research task with relatively high attention in the field of computer vision, and the overall objective of the task is as follows: a series of video frames are input, and future predicted video frames are output according to information in the input video frames.
In the process of implementing the invention, the inventor finds that at least the following technical problems exist in the prior art: the prior art has the problem that the predicted semantic information is unclear.
Disclosure of Invention
The invention provides an image prediction model training method and device, electronic equipment and a storage medium, and aims to improve the accuracy of prediction semantic information.
According to an aspect of the present invention, there is provided an image prediction model training method, including:
acquiring a plurality of groups of training sample data, wherein the training sample data comprises an image sequence sample and an image sequence label corresponding to the image sequence sample;
inputting the image sequence samples into a generator for generating a countermeasure network to obtain a predicted image sequence;
inputting the predicted image sequence and the image sequence label corresponding to the image sequence sample into the discriminator of the generation countermeasure network to obtain a discrimination result, and training the generation countermeasure network based on the discrimination result until a training stop condition is met to obtain an image prediction model;
the generator and the discriminator comprise three-dimensional convolution and three-dimensional deconvolution, the three-dimensional convolution and the three-dimensional deconvolution are used for extracting image time sequence information from input information, and the image time sequence information comprises frame information of an image.
According to another aspect of the present invention, there is provided an image prediction model training apparatus, including:
the training sample data acquisition module is used for acquiring a plurality of groups of training sample data, wherein the training sample data comprises an image sequence sample and an image sequence label corresponding to the image sequence sample;
the generator processing module is used for inputting the image sequence samples into a generator for generating a countermeasure network to obtain a predicted image sequence;
the model training module is used for inputting the predicted image sequence and the image sequence label corresponding to the image sequence sample into the discriminator of the generation countermeasure network to obtain a discrimination result, and training the generation countermeasure network based on the discrimination result until a training stop condition is met to obtain an image prediction model;
the generator and the discriminator comprise three-dimensional convolution and three-dimensional deconvolution, the three-dimensional convolution and the three-dimensional deconvolution are used for extracting image time sequence information from input information, and the image time sequence information comprises frame information of an image.
According to another aspect of the present invention, there is provided an image prediction method including:
acquiring an original image sequence;
and inputting the original image sequence into the image prediction model according to any embodiment of the invention to obtain a target image sequence.
According to another aspect of the present invention, there is provided an electronic apparatus including:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein,
the memory stores a computer program executable by the at least one processor, the computer program being executable by the at least one processor to enable the at least one processor to perform the image prediction model training method, or the image prediction method, according to any of the embodiments of the present invention.
According to another aspect of the present invention, there is provided a computer-readable storage medium storing computer instructions for causing a processor to implement the image prediction model training method or the image prediction method according to any one of the embodiments of the present invention when executed.
According to the technical scheme of the embodiment of the invention, the image sequence sample and the image sequence label corresponding to the image sequence sample are obtained; inputting the image sequence samples into a generator for generating a countermeasure network to obtain a predicted image sequence; inputting the predicted image sequence and the image sequence label corresponding to the image sequence sample into a discriminator for generating the countermeasure network to obtain a discrimination result, training the countermeasure network based on the discrimination result until a training stop condition is met, and obtaining an image prediction model; the generator and the discriminator comprise three-dimensional convolution and three-dimensional deconvolution, the three-dimensional convolution and the three-dimensional deconvolution are used for extracting image time sequence information from input information, and the image time sequence information comprises frame information of images. According to the technical scheme, the three-dimensional convolution and the three-dimensional deconvolution are used for improving the capturing capacity of the image time sequence information of the image prediction model, so that the semantic information of the predicted image is more clear.
It should be understood that the statements in this section are not intended to identify key or critical features of the embodiments of the present invention, nor are they intended to limit the scope of the invention. Other features of the present invention will become apparent from the following description.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings required to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the description below are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a flowchart of an image prediction model training method according to an embodiment of the present invention;
fig. 2 is a schematic structural diagram of a U-NET network unit according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a structure of a discriminator according to an embodiment of the invention;
FIG. 4 is a flowchart of an image prediction model training method according to a second embodiment of the present invention;
FIG. 5 is a comparison graph of prediction results of different model video frames according to the second embodiment of the present invention;
FIG. 6 is a graph of PSNR of different models according to a second embodiment of the present invention;
FIG. 7 is a graph of SSIM variation with frame number according to a second embodiment of the present invention;
FIG. 8 is a flowchart of an image prediction model training method according to a third embodiment of the present invention;
FIG. 9 is a schematic structural diagram of an image prediction model training apparatus according to a fourth embodiment of the present invention;
FIG. 10 is a schematic diagram of an image prediction apparatus according to a fifth embodiment of the present invention;
fig. 11 is a schematic structural diagram of an electronic device implementing the image prediction model training method according to the embodiment of the present invention.
Detailed Description
In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
Example one
Fig. 1 is a flowchart of an image prediction model training method according to an embodiment of the present invention, where the embodiment is applicable to a case of training a prediction model of a video or an image sequence, and the method may be executed by an image prediction model training apparatus, where the image prediction model training apparatus may be implemented in a form of hardware and/or software, and the image prediction model training apparatus may be configured in a computer terminal. As shown in fig. 1, the method includes:
s110, obtaining multiple groups of training sample data, wherein the training sample data comprise image sequence samples and image sequence labels corresponding to the image sequence samples.
In this embodiment, the image sequence sample refers to a plurality of images arranged in time for model training. For example, the image sequence samples may be images or video frames containing a plurality of moving objects. The image sequence label refers to label information corresponding to the image sequence sample.
Specifically, training sample data may be obtained from a preset path of the electronic device, or training sample data may be obtained from other devices or a cloud, which is not limited herein.
And S120, inputting the image sequence samples to a generator for generating a countermeasure network to obtain a predicted image sequence.
S130, inputting the predicted image sequence and the image sequence label corresponding to the image sequence sample into the discriminator of the generation countermeasure network to obtain a discrimination result, and training the generation countermeasure network based on the discrimination result until a training stop condition is met to obtain an image prediction model.
In this embodiment, the generator and the discriminator include a three-dimensional convolution and a three-dimensional deconvolution, where the three-dimensional convolution and the three-dimensional deconvolution are used to extract image timing information from input information, and the image timing information includes frame information of an image. The discrimination result is used for representing the authenticity of the predicted image sequence.
It should be noted that, compared with two-dimensional convolution and two-dimensional deconvolution, three-dimensional convolution and three-dimensional deconvolution can make the generation countermeasure network adapt to the input of continuous image sequence samples, and improve the capturing capability of image time sequence information, so that the image semantic information predicted by the image prediction model is more definite.
In some optional embodiments, the generator includes a U-NET network unit for acquiring higher order image information in the image sequence samples.
An exemplary network structure of a U-NET network unit is shown in fig. 2. It should be noted that, in this embodiment, a U-NET network unit is added to the generator portion, so that the decoding portion of the generator can effectively receive high-order image information, and the image quality predicted by the image prediction model is greatly improved. The network structure of the arbiter is shown in fig. 3.
According to the technical scheme of the embodiment of the invention, the image sequence sample and the image sequence label corresponding to the image sequence sample are obtained; inputting the image sequence samples into a generator for generating a countermeasure network to obtain a predicted image sequence; inputting the predicted image sequence and the image sequence label corresponding to the image sequence sample into a discriminator for generating the countermeasure network to obtain a discrimination result, training the countermeasure network based on the discrimination result until a training stop condition is met, and obtaining an image prediction model; the generator and the discriminator comprise three-dimensional convolution and three-dimensional deconvolution, the three-dimensional convolution and the three-dimensional deconvolution are used for extracting image time sequence information from input information, and the image time sequence information comprises frame information of an image. According to the technical scheme, the capturing capacity of the image time sequence information of the image prediction model is improved by using the three-dimensional convolution and the three-dimensional deconvolution, so that the semantic information of the predicted image is more clear.
Example two
Fig. 4 is a flowchart of an image prediction model training method provided in the second embodiment of the present invention, and the method in this embodiment may be combined with each alternative in the image prediction model training method provided in the foregoing embodiment. The training method of the image prediction model provided by the embodiment is further optimized. Optionally, the generation countermeasure network is an improved network based on a WGAN-GP network model.
As shown in fig. 4, the method includes:
s210, obtaining a plurality of groups of training sample data, wherein the training sample data comprises image sequence samples and image sequence labels corresponding to the image sequence samples.
S220, inputting the image sequence samples to a generator for generating a countermeasure network to obtain a predicted image sequence, wherein the countermeasure network is an improved network based on a WGAN-GP network model.
S330, inputting the predicted image sequence and the image sequence label corresponding to the image sequence sample into the discriminator of the generation countermeasure network to obtain a discrimination result, and training the generation countermeasure network based on the discrimination result until a training stop condition is met to obtain an image prediction model.
In this embodiment, the generation of the countermeasure network is an improved network based on a WGAN-GP network model, where WGAN-GP refers to a GAN based on a bulldozer distance principle with a gradient penalty, and is easier to train.
In some alternative embodiments, the image sequence samples are consecutive video frames of a predetermined number of frames, the image sequence tags are tags corresponding to the consecutive video frames, the predicted image sequence can be predicted future video frames, and the image timing information can be video timing information.
Specifically, continuous video frames and labels corresponding to the continuous video frames are obtained; inputting the continuous video frames into a generator for generating a countermeasure network to obtain predicted future video frames; inputting labels corresponding to predicted future video frames and continuous video frames to a discriminator for generating a confrontation network to obtain a discrimination result, training the confrontation network based on the discrimination result until a training stop condition is met, and obtaining an image prediction model; the generator and the discriminator comprise three-dimensional convolution and three-dimensional deconvolution, the three-dimensional convolution and the three-dimensional deconvolution are used for extracting video time sequence information from input information, and the video time sequence information comprises frame information of videos. According to the technical scheme, the capturing capacity of the video time sequence information of the image prediction model is improved by using the three-dimensional convolution and the three-dimensional deconvolution, so that the semantic information of the predicted video is more definite. In some alternative embodiments, the frame number of the consecutive video frames may be m frames before the video, the predicted future video frame may be n frames after the video, and m and n may be the same or different, which is not limited herein.
In some optional embodiments, generating the loss function against the network comprises:
L=L adv +λL 1
wherein L represents a loss function for generating a countermeasure network, L adv Representing a generative opposed network loss function with a gradient penalty, λ representing a hyper-parameter, L 1 The mean absolute error loss function is represented.
In some optional embodiments, generating the countering network loss function with the gradient penalty includes:
where E denotes the output expectation, x denotes the image sequence randomly drawn from the image sequence label, V lable D (x) represents the discrimination result of the discriminator on the image sequence randomly extracted from the image sequence labels; z denotes a sequence of images, V, randomly extracted from samples of the sequence of images input Representing samples of an image sequence, V input ∈R T×C×H×W Where T denotes the number of frames, G (z) denotes the result of the generator's generation of a sequence of images randomly drawn from the samples of the sequence of images, λ GP Representing a hyper-parameter, L, for balancing generation of countering network losses and gradient penalties GP Represents a penalty term that causes the arbiter to satisfy 1-Lipschitz continuity.Wherein,the gradient is represented by the number of lines,represents V input And V lable The difference between the values of the two signals, alpha is a random number between (0, 1).
It can be understood that, the generation with the gradient penalty confronts the constraint problem that the network loss is the maximum and minimum, the discriminator feeds back the distance between the source distribution and the target distribution to the generator by measuring the distance between the source distribution and the target distribution, and the generator gradually reduces the distance between the source distribution and the target distribution by parameter optimization, thereby finally achieving the effect that the source distribution is close to the target distribution. Wherein the source distribution is V input Target distribution is V lable 。
In some alternative embodiments, the average absolute error loss function comprises:
wherein, X m Representing the m-th image frame, Y, in a label of the image sequence m Representing the m-th image frame in the predicted image sequence.
Illustratively, this example performed experiments on a data set Moving-mnst, and the experimental results are shown in fig. 5, 6, and 7. FIG. 5 is a comparison graph of prediction results of different model video frames according to the second embodiment of the present invention; fig. 5 (a) shows the video frame prediction results of the WGAN model in the 1 st, 3 rd, and 6 th frames, respectively, (b) shows the video frame prediction results of the TGAN model in the 1 st, 3 rd, and 6 th frames, respectively, (c) shows the video frame prediction results of the FutureGAN model in the 1 st, 3 rd, and 6 th frames, respectively, (b) shows the video frame prediction results of the FRNN model in the 1 st, 3 rd, and 6 th frames, respectively, and (e) shows the video frame prediction results of the image prediction model of the present embodiment in the 1 st, 3 rd, and 6 th frames, respectively. Fig. 6 is a graph of PSNR versus frame number for different models according to a second embodiment of the present invention, where PSNR in fig. 6 represents a peak snr. Fig. 7 is a graph of the variation of SSIM of different models according to the second embodiment of the present invention with frame number, where SSIM in fig. 7 represents structural similarity. The experimental result shows that the image prediction model of the embodiment has obvious advantages in video prediction tasks compared with other known models, and the problem of video quality reduction along with the increase of predicted frames is solved while the semantics of video frames are increased.
EXAMPLE III
Fig. 8 is a flowchart of an image prediction method according to a third embodiment of the present invention, and the method of the present embodiment may be combined with various alternatives of the image prediction model training method provided in the foregoing embodiments.
As shown in fig. 8, the method includes:
s310, obtaining a plurality of groups of training sample data, wherein the training sample data comprises image sequence samples and image sequence labels corresponding to the image sequence samples.
And S220, inputting the image sequence samples into a generator for generating a countermeasure network to obtain a predicted image sequence.
S330, inputting the predicted image sequence and the image sequence label corresponding to the image sequence sample into the discriminator of the generation countermeasure network to obtain a discrimination result, and training the generation countermeasure network based on the discrimination result until a training stop condition is met to obtain an image prediction model.
And S340, acquiring an original image sequence.
And S350, inputting the original image sequence into an image prediction model to obtain a target image sequence.
In the embodiment, the capturing capability of the image time sequence information of the image prediction model is improved by using the three-dimensional convolution and the three-dimensional deconvolution, so that the semantic information of the predicted image is more definite.
In some alternative embodiments, the original image sequence may be a succession of video frames and, correspondingly, the target image sequence may be a predicted future video frame. The capturing capability of the video time sequence information of the image prediction model is improved by using three-dimensional convolution and three-dimensional deconvolution, so that the semantic information of the predicted video is more definite.
Example four
Fig. 9 is a schematic structural diagram of an image prediction model training device according to a fourth embodiment of the present invention. As shown in fig. 9, the apparatus includes:
a training sample data obtaining module 410, configured to obtain multiple sets of training sample data, where the training sample data includes an image sequence sample and an image sequence tag corresponding to the image sequence sample;
a generator processing module 420, configured to input the image sequence samples to a generator for generating a countermeasure network, so as to obtain a predicted image sequence;
the model training module 430 is configured to input the predicted image sequence and the image sequence label corresponding to the image sequence sample to the discriminator for generating the countermeasure network to obtain a discrimination result, train the countermeasure network based on the discrimination result until a training stop condition is met, and obtain an image prediction model;
the generator and the discriminator comprise three-dimensional convolution and three-dimensional deconvolution, the three-dimensional convolution and the three-dimensional deconvolution are used for extracting image time sequence information from input information, and the image time sequence information comprises frame information of an image.
According to the technical scheme of the embodiment of the invention, the image sequence sample and the image sequence label corresponding to the image sequence sample are obtained; inputting the image sequence samples into a generator for generating a countermeasure network to obtain a predicted image sequence; inputting the predicted image sequence and the image sequence label corresponding to the image sequence sample into a discriminator for generating the countermeasure network to obtain a discrimination result, training the countermeasure network based on the discrimination result until a training stop condition is met, and obtaining an image prediction model; the generator and the discriminator comprise three-dimensional convolution and three-dimensional deconvolution, the three-dimensional convolution and the three-dimensional deconvolution are used for extracting image time sequence information from input information, and the image time sequence information comprises frame information of an image. According to the technical scheme, the capturing capacity of the image time sequence information of the image prediction model is improved by using the three-dimensional convolution and the three-dimensional deconvolution, so that the semantic information of the predicted image is more clear.
In some optional embodiments, the generator comprises a U-NET network unit for acquiring higher order image information in the image sequence samples.
In some optional embodiments, the generative countermeasure network is an improved network based on the WGAN-GP network model.
In some alternative embodiments, the image sequence samples are consecutive video frames of a predetermined number of frames.
In some optional embodiments, the generating a loss function against the network comprises:
L=L adv +λL 1
wherein L represents a loss function for generating the countermeasure network, L adv Representing a generative opposed-network-loss function with a gradient penalty, λ representing a hyper-parameter, L 1 The mean absolute error loss function is represented.
In some optional embodiments, the generating a countering network loss function with gradient penalty comprises:
where E denotes the output expectation, x denotes the image sequence randomly drawn from the image sequence label, V lable D (x) represents the discrimination result of the discriminator on the image sequence randomly extracted from the image sequence labels; z denotes a sequence of images, V, randomly extracted from samples of the sequence of images input Representing samples of the image sequence, G (z) representing the result of the generator's generation of the image sequence randomly drawn from the samples of the image sequence, λ GP Representing a hyper-parameter, L, for balancing generation of countering network losses and gradient penalties GP Represents a penalty term that causes the arbiter to satisfy 1-Lipschitz continuity.
In some optional embodiments, the mean absolute error loss function comprises:
wherein, X m Representing the m-th image frame, Y, in a label of the image sequence m Representing the m-th image frame in the predicted image sequence.
The image prediction model training device provided by the embodiment of the invention can execute the image prediction model training method provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method.
EXAMPLE five
Fig. 10 is a schematic structural diagram of an image prediction apparatus according to a fifth embodiment of the present invention. As shown in fig. 10, the apparatus includes:
a training sample data obtaining module 510, configured to obtain multiple sets of training sample data, where the training sample data includes an image sequence sample and an image sequence tag corresponding to the image sequence sample;
a generator processing module 520, configured to input the image sequence samples to a generator for generating a countermeasure network, so as to obtain a predicted image sequence;
a model training module 530, configured to input the predicted image sequence and the image sequence label corresponding to the image sequence sample to the identifier for generating the countermeasure network to obtain a determination result, train the countermeasure network based on the determination result, until a training stop condition is met, and obtain an image prediction model;
an original image sequence obtaining module 540, configured to obtain an original image sequence;
and a target image sequence prediction module 550, configured to input the original image sequence to an image prediction model, so as to obtain a target image sequence.
The image prediction device provided by the embodiment of the invention can execute the image prediction method provided by any embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method.
Example six
FIG. 11 illustrates a block diagram of an electronic device 10 that may be used to implement an embodiment of the invention. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital assistants, cellular phones, smart phones, wearable devices (e.g., helmets, glasses, watches, etc.), and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed herein.
As shown in fig. 11, the electronic device 10 includes at least one processor 11, and a memory communicatively connected to the at least one processor 11, such as a Read Only Memory (ROM) 12, a Random Access Memory (RAM) 13, and the like, wherein the memory stores a computer program executable by the at least one processor, and the processor 11 can perform various suitable actions and processes according to the computer program stored in the Read Only Memory (ROM) 12 or the computer program loaded from a storage unit 18 into the Random Access Memory (RAM) 13. In the RAM 13, various programs and data necessary for the operation of the electronic apparatus 10 can also be stored. The processor 11, the ROM 12, and the RAM 13 are connected to each other via a bus 14. An input/output (I/O) interface 15 is also connected to bus 14.
A number of components in the electronic device 10 are connected to the I/O interface 15, including: an input unit 16 such as a keyboard, a mouse, or the like; an output unit 17 such as various types of displays, speakers, and the like; a storage unit 18 such as a magnetic disk, an optical disk, or the like; and a communication unit 19 such as a network card, modem, wireless communication transceiver, etc. The communication unit 19 allows the electronic device 10 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.
The processor 11 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of processor 11 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various processors running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, or the like. Processor 11 performs the various methods and processes described above, such as an image prediction model training method, which includes:
acquiring a plurality of groups of training sample data, wherein the training sample data comprises an image sequence sample and an image sequence label corresponding to the image sequence sample;
inputting the image sequence samples into a generator for generating a countermeasure network to obtain a predicted image sequence;
inputting the predicted image sequence and an image sequence label corresponding to the image sequence sample into a discriminator of the generation confrontation network to obtain a discrimination result, and training the generation confrontation network based on the discrimination result until a training stop condition is met to obtain an image prediction model;
the generator and the discriminator comprise three-dimensional convolution and three-dimensional deconvolution, the three-dimensional convolution and the three-dimensional deconvolution are used for extracting image time sequence information from input information, and the image time sequence information comprises frame information of an image;
alternatively, an image prediction method includes:
acquiring an original image sequence;
and inputting the original image sequence into an image prediction model to obtain a target image sequence.
In some embodiments, the image prediction model training method, or the image prediction method, may be implemented as a computer program tangibly embodied in a computer-readable storage medium, such as storage unit 18. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 10 via the ROM 12 and/or the communication unit 19. When the computer program is loaded into RAM 13 and executed by processor 11, one or more steps of the image prediction model training method, or the image prediction method, described above may be performed. Alternatively, in other embodiments, the processor 11 may be configured to perform the image prediction model training method, or the image prediction method, in any other suitable manner (e.g., by means of firmware).
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.
A computer program for implementing the methods of the present invention may be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the computer programs, when executed by the processor, cause the functions/acts specified in the flowchart and/or block diagram block or blocks to be performed. A computer program can execute entirely on a machine, partly on a machine, as a stand-alone software package partly on a machine and partly on a remote machine or entirely on a remote machine or server.
In the context of the present invention, a computer-readable storage medium may be a tangible medium that can contain, or store a computer program for use by or in connection with an instruction execution system, apparatus, or device. A computer readable storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. Alternatively, the computer readable storage medium may be a machine readable signal medium. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on an electronic device having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user may provide input to the electronic device. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), blockchain networks, and the internet.
The computing system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service are overcome.
It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present invention may be executed in parallel, sequentially, or in different orders, and are not limited herein as long as the desired results of the technical solution of the present invention can be achieved.
The above-described embodiments should not be construed as limiting the scope of the invention. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.
Claims (10)
1. An image prediction model training method is characterized by comprising the following steps:
acquiring a plurality of groups of training sample data, wherein the training sample data comprises an image sequence sample and an image sequence label corresponding to the image sequence sample;
inputting the image sequence samples into a generator for generating a countermeasure network to obtain a predicted image sequence;
inputting the predicted image sequence and an image sequence label corresponding to the image sequence sample into a discriminator of the generation confrontation network to obtain a discrimination result, and training the generation confrontation network based on the discrimination result until a training stop condition is met to obtain an image prediction model;
the generator and the discriminator comprise three-dimensional convolution and three-dimensional deconvolution, the three-dimensional convolution and the three-dimensional deconvolution are used for extracting image time sequence information from input information, and the image time sequence information comprises frame information of an image.
2. The method of claim 1, wherein the generator comprises a U-NET network unit for obtaining higher order image information in the image sequence samples.
3. The method according to claim 1, characterized in that said generation of antagonistic networks is an improved network based on the WGAN-GP network model.
4. The method of claim 1, wherein the samples of the image sequence are consecutive video frames of a predetermined number of frames.
5. The method of claim 1, wherein generating the loss function for the counterpoise network comprises:
L=L adv +λL 1
wherein L represents a loss function for generating a countermeasure network, L adv Representing a generative opposed-network-loss function with a gradient penalty, λ representing a hyper-parameter, L 1 The mean absolute error loss function is represented.
6. The method of claim 5, the generating a countering network loss function with gradient penalty, comprising:
where E denotes the output expectation, x denotes the image sequence randomly extracted from the image sequence tag, V lable Representing a sequence of imagesThe image sequence label corresponding to the sample, D (x) represents the discrimination result of the discriminator on the image sequence randomly extracted from the image sequence label; z denotes a sequence of images, V, randomly extracted from samples of the sequence of images input Representing samples of the image sequence, G (z) representing the result of the generator's generation of a sequence of images randomly drawn from the samples of the image sequence, λ GP Representing a hyper-parameter, L, for balancing generation of countering network losses and gradient penalties GP Represents a penalty term that causes the arbiter to satisfy 1-Lipschitz continuity.
7. An image prediction method, comprising:
acquiring an original image sequence;
inputting the original image sequence into the image prediction model of any one of claims 1-6 to obtain a target image sequence.
8. An image prediction model training apparatus, comprising:
the training sample data acquisition module is used for acquiring a plurality of groups of training sample data, wherein the training sample data comprises an image sequence sample and an image sequence label corresponding to the image sequence sample;
the generator processing module is used for inputting the image sequence samples into a generator for generating a countermeasure network to obtain a predicted image sequence;
the model training module is used for inputting the predicted image sequence and the image sequence label corresponding to the image sequence sample into the discriminator of the generation countermeasure network to obtain a discrimination result, and training the generation countermeasure network based on the discrimination result until a training stop condition is met to obtain an image prediction model;
the generator and the discriminator comprise three-dimensional convolution and three-dimensional deconvolution, the three-dimensional convolution and the three-dimensional deconvolution are used for extracting image time sequence information from input information, and the image time sequence information comprises frame information of an image.
9. An electronic device, characterized in that the electronic device comprises:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein,
the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the image prediction model training method of any one of claims 1-6, or the image prediction method of claim 7.
10. A computer-readable storage medium storing computer instructions for causing a processor to implement the image prediction model training method of any one of claims 1 to 6, or the image prediction method of claim 7 when executed.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211318456.3A CN115601612A (en) | 2022-10-26 | 2022-10-26 | Image prediction model training method and device, electronic equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211318456.3A CN115601612A (en) | 2022-10-26 | 2022-10-26 | Image prediction model training method and device, electronic equipment and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115601612A true CN115601612A (en) | 2023-01-13 |
Family
ID=84851559
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211318456.3A Pending CN115601612A (en) | 2022-10-26 | 2022-10-26 | Image prediction model training method and device, electronic equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115601612A (en) |
-
2022
- 2022-10-26 CN CN202211318456.3A patent/CN115601612A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113255694B (en) | Training image feature extraction model and method and device for extracting image features | |
CN113360700B (en) | Training of image-text retrieval model, image-text retrieval method, device, equipment and medium | |
CN112949767A (en) | Sample image increment, image detection model training and image detection method | |
CN113627536B (en) | Model training, video classification method, device, equipment and storage medium | |
CN113378855A (en) | Method for processing multitask, related device and computer program product | |
CN115358392A (en) | Deep learning network training method, text detection method and text detection device | |
CN113657249A (en) | Training method, prediction method, device, electronic device, and storage medium | |
CN113792876B (en) | Backbone network generation method, device, equipment and storage medium | |
CN113177483B (en) | Video object segmentation method, device, equipment and storage medium | |
CN112784102B (en) | Video retrieval method and device and electronic equipment | |
CN116340831B (en) | Information classification method and device, electronic equipment and storage medium | |
CN115169489B (en) | Data retrieval method, device, equipment and storage medium | |
CN114882313B (en) | Method, device, electronic equipment and storage medium for generating image annotation information | |
CN115527069A (en) | Article identification and article identification system construction method and apparatus | |
CN113139463B (en) | Method, apparatus, device, medium and program product for training a model | |
CN115601612A (en) | Image prediction model training method and device, electronic equipment and storage medium | |
CN114724144A (en) | Text recognition method, model training method, device, equipment and medium | |
CN114842541A (en) | Model training and face recognition method, device, equipment and storage medium | |
CN115082624A (en) | Human body model construction method and device, electronic equipment and storage medium | |
CN114882334A (en) | Method for generating pre-training model, model training method and device | |
CN113556575A (en) | Method, apparatus, device, medium and product for compressing data | |
CN113963167A (en) | Method, device and computer program product applied to target detection | |
CN113627354A (en) | Model training method, video processing method, device, equipment and storage medium | |
CN117112816B (en) | Sorting method, device, equipment and storage medium for security inspection images | |
CN116361658B (en) | Model training method, task processing method, device, electronic equipment and medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |