CN117292024B - Voice-based image generation method and device, medium and electronic equipment - Google Patents
Voice-based image generation method and device, medium and electronic equipment Download PDFInfo
- Publication number
- CN117292024B CN117292024B CN202311580355.8A CN202311580355A CN117292024B CN 117292024 B CN117292024 B CN 117292024B CN 202311580355 A CN202311580355 A CN 202311580355A CN 117292024 B CN117292024 B CN 117292024B
- Authority
- CN
- China
- Prior art keywords
- vector
- connection
- voice
- embedded
- processing
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 88
- 239000013598 vector Substances 0.000 claims abstract description 415
- 238000012545 processing Methods 0.000 claims abstract description 49
- 238000013136 deep learning model Methods 0.000 claims abstract description 35
- 230000008569 process Effects 0.000 claims abstract description 22
- 230000007246 mechanism Effects 0.000 claims abstract description 16
- 238000007499 fusion processing Methods 0.000 claims description 30
- 230000004927 fusion Effects 0.000 claims description 22
- 238000004590 computer program Methods 0.000 claims description 11
- 238000003860 storage Methods 0.000 claims description 9
- 238000009826 distribution Methods 0.000 claims description 8
- 238000004891 communication Methods 0.000 claims description 5
- 238000013528 artificial neural network Methods 0.000 description 13
- 238000004364 calculation method Methods 0.000 description 4
- 238000010586 diagram Methods 0.000 description 4
- 239000011159 matrix material Substances 0.000 description 4
- 230000008878 coupling Effects 0.000 description 3
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- 230000006870 function Effects 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 3
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 239000007787 solid Substances 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 238000013527 convolutional neural network Methods 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 239000000835 fiber Substances 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T13/00—Animation
- G06T13/20—3D [Three Dimensional] animation
- G06T13/205—3D [Three Dimensional] animation driven by audio data
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/16—Sound input; Sound output
- G06F3/165—Management of the audio stream, e.g. setting of volume, audio stream path
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
- G06N3/0455—Auto-encoder networks; Encoder-decoder networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T5/00—Image enhancement or restoration
- G06T5/50—Image enhancement or restoration using two or more images, e.g. averaging or subtraction
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20081—Training; Learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20084—Artificial neural networks [ANN]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20212—Image combination
- G06T2207/20221—Image fusion; Image merging
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Computing Systems (AREA)
- Biomedical Technology (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Molecular Biology (AREA)
- Biophysics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Image Processing (AREA)
Abstract
The application provides a voice-based image generation method, a voice-based image generation device, a voice-based image generation medium and electronic equipment. The image generation method comprises the following steps: acquiring voice for generating a target image; processing the voice through an audio encoder to obtain an embedded vector of the voice; obtaining an embedded vector of a sample; connecting the embedded vector of the voice and the embedded vector of the sample to obtain a connection vector; processing the connection vector and the embedded vector of the voice through a deep learning model to obtain a noise vector and a denoising vector of the connection vector based on the noise vector and the connection vector, wherein the deep learning model comprises an encoder and a first decoder based on a cross-attention mechanism; the de-noised vector is processed by a second decoder to obtain the target image. The image generation method can reduce the complexity of the whole image generation process and can improve the accuracy of the generated target image.
Description
Technical Field
The present application relates to a method for generating an image based on voice, and in particular, to a method, an apparatus, a medium, and an electronic device for generating an image based on voice.
Background
Speech recognition is a technology that has been developed rapidly in recent years, and is now widely used in daily life production of people, text generation recording is performed by speech, operation instruction recognition, and the like.
At present, an Image generating method based on voice is generally generated by a task cascading method, for example, an ASR (Automatic Speech Recognition, automatic voice recognition) +text2Image (Text-to-Image) cascading method is used for generating, and because the method at least needs a plurality of models, the method has the problems of higher complexity of the whole Image generating process and lower accuracy of generated images.
Disclosure of Invention
The invention aims to provide a voice-based image generation method, a voice-based image generation device, a voice-based image generation medium and an electronic device, which are used for solving the problems of high complexity of an overall image generation process and low accuracy of a generated image in the existing image generation method.
In a first aspect, the present application provides a voice-based image generation method, the image generation method including: acquiring voice for generating a target image; processing the voice through an audio encoder to obtain an embedded vector of the voice; obtaining an embedded vector of a sample, wherein the sample is randomly sampled from Gaussian distribution; connecting the embedded vector of the voice and the embedded vector of the sample to obtain a connection vector; processing the connection vector and the embedded vector of the voice through a deep learning model to obtain a noise vector and a denoising vector of the connection vector based on the noise vector and the connection vector, wherein the deep learning model comprises an encoder and a first decoder based on a cross-attention mechanism; and processing the denoising vector of the connection vector through a second decoder to acquire the target image.
In the image generation method, the image can be directly generated according to the voice only by introducing the deep learning model, the cascade of tasks is not involved, the complexity of the whole image generation process is low, and the cascade of tasks is not involved, so that the problems of information loss and distortion easily existing between different tasks are avoided, and the precision of the finally generated target image is high.
In an embodiment of the present application, the implementation method for processing the connection vector and the embedded vector of the voice through a deep learning model to obtain a noise vector and obtain a denoising vector of the connection vector based on the noise vector and the connection vector includes:
s1: processing the connection vector and the embedded vector of the voice through the deep learning model to obtain a noise vector and a denoising vector of the connection vector based on the noise vector and the connection vector;
s2: if the current iteration number is smaller than the preset iteration number, turning to S1, adding 1 to the value of the current iteration number, wherein the connection vector in S1 is an updated connection vector, the updated connection vector is a denoising vector of the connection vector, and otherwise, acquiring the denoising vector of the connection vector.
In an embodiment of the present application, the implementation method for processing the connection vector and the embedded vector of the speech through the deep learning model to obtain the noise vector includes: performing first cross fusion processing on the connection vector and the embedded vector of the voice through the stacked encoders to obtain a fusion vector; and performing second cross fusion processing on the fusion vector and the connection vector through the stacked first decoder so as to acquire the noise vector.
In an embodiment of the present application, the implementation method for obtaining the fusion vector by performing a first cross fusion process on the connection vector and the embedded vector of the speech through the stacked encoders includes: and performing first cross fusion processing on a first query vector, a first value vector and a first key vector through the stacked encoders to obtain the fusion vector, wherein the first query vector and the first key vector are both connection vectors, and the first value vector is an embedded vector of the voice.
In an embodiment of the present application, the implementation method for obtaining the noise vector by performing a second cross fusion process on the fusion vector and the connection vector by using the stacked first decoder includes: and performing second cross fusion processing on a second query vector, a second value vector and a second key vector through the stacked first decoder to acquire the noise vector, wherein the second query vector and the second key vector are both the connection vector, and the second value vector is the fusion vector.
In an embodiment of the present application, the denoising vector of the connection vector is a vector obtained by subtracting the connection vector from the noise vector.
In an embodiment of the present application, the second decoder is a decoder in a variant automatic encoder.
In a second aspect, the present application provides a voice-based image generation apparatus, the image generation apparatus comprising: a voice acquisition module for acquiring voice for generating a target image; the voice processing module is used for processing the voice through the audio encoder so as to acquire an embedded vector of the voice; the sample acquisition module is used for acquiring an embedded vector of a sample, wherein the sample is randomly sampled from Gaussian distribution; the connection processing module is used for carrying out connection processing on the embedded vector of the voice and the embedded vector of the sample so as to obtain a connection vector; the noise acquisition module is used for processing the connection vector and the embedded vector of the voice through a deep learning model to acquire a noise vector and a denoising vector of the connection vector based on the noise vector and the connection vector, wherein the deep learning model consists of an encoder and a first decoder based on a cross attention mechanism; and the image acquisition module is used for processing the denoising vector of the connection vector through a second decoder so as to acquire the target image.
In a third aspect, the present application provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the image generation method according to any of the first aspects of the present application.
In a fourth aspect, the present application provides an electronic device, including: a memory storing a computer program; and the processor is in communication connection with the memory and executes the image generation method according to any one of the first aspect of the application when the computer program is called.
As described above, the method, the device, the medium and the electronic device for generating the voice-based image have the following beneficial effects:
in the image generation method, the image can be directly generated according to the voice only by introducing the deep learning model, the cascade of tasks is not involved, the complexity of the whole image generation process is low, and the cascade of tasks is not involved, so that the problems of information loss and distortion easily existing between different tasks are avoided, and the precision of the finally generated target image is high.
Drawings
Fig. 1 is a schematic diagram of a hardware structure for running the image generating method according to the embodiment of the present application.
Fig. 2 is a flowchart of an image generating method according to an embodiment of the present application.
Fig. 3 is a flowchart of an implementation method for processing the connection vector and the embedded vector of the voice through a deep learning model to obtain a noise vector and obtain a denoising vector of the connection vector based on the noise vector and the connection vector according to an embodiment of the present application.
Fig. 4 is a flowchart of an implementation method for processing the connection vector and the embedded vector of the voice through a deep learning model to obtain a noise vector according to an embodiment of the present application.
Fig. 5 is a schematic process diagram of an image generating method according to an embodiment of the present application.
Fig. 6 is a schematic structural diagram of an image generating apparatus according to an embodiment of the present application.
Description of element reference numerals
10. Electronic equipment
110. Memory device
120. Processor and method for controlling the same
1210. Central processing unit
1220. Neural network processor
12210. Neural network implementation engine
12220. Special hardware circuit
122210 matrix calculation unit
122220, vector calculation unit
600. Image generating device based on voice
610. Voice acquisition module
620. Speech processing module
630. Sample acquisition module
640. Connection processing module
650. Noise acquisition module
660. Image acquisition module
S11-S16 step
S1-S2 step
S21-S22 step
Detailed Description
Other advantages and effects of the present application will become apparent to those skilled in the art from the present disclosure, when the following description of the embodiments is taken in conjunction with the accompanying drawings. The present application may be embodied or carried out in other specific embodiments, and the details of the present application may be modified or changed from various points of view and applications without departing from the spirit of the present application. It should be noted that the following embodiments and features in the embodiments may be combined with each other without conflict.
It should be noted that, the illustrations provided in the following embodiments merely illustrate the basic concepts of the application by way of illustration, and only the components related to the application are shown in the drawings and are not drawn according to the number, shape and size of the components in actual implementation, and the form, number and proportion of the components in actual implementation may be arbitrarily changed, and the layout of the components may be more complex.
The following describes the technical solutions in the embodiments of the present application in detail with reference to the drawings in the embodiments of the present application.
The image generation method provided by the embodiment of the application can be operated in the electronic equipment. Taking fig. 1 as an example, fig. 1 is a block diagram of a hardware structure of an electronic device in the image generating method. The electronic device 10 comprises a memory 110 and a processor 120, which processor 120 may be a central processor 1210 or a dedicated neural network processor 1220, which neural network processor 1220 comprises a neural network implementation engine 12210 and dedicated hardware circuits 12220, which dedicated hardware circuits 12220 comprise a matrix calculation unit 122210 and a vector calculation unit 122220.
Alternatively, the neural network processor 1220 is a processor that performs neural network computation using a dedicated hardware circuit 12220, the dedicated hardware circuit 12220 being an integrated circuit for performing neural network computation and including a matrix computation unit 122210 and a vector computation unit 122220 that perform vector-matrix multiplication in hardware.
Optionally, the neural network implementation engine 12210 is configured to generate instructions for execution by the dedicated hardware circuit 12220, which when executed by the dedicated hardware circuit 12220, cause the dedicated hardware circuit 12220 to perform operations specified by a neural network to generate a neural network output from the received neural network input.
As shown in fig. 2, the present embodiment provides a voice-based image generation method, which may be implemented by a processor of a computer device, the image generation method including:
s11, acquiring voice for generating a target image.
S12, processing the voice through an audio encoder to obtain an embedded vector of the voice.
Alternatively, the audio encoder may refer to a model based on a deep convolutional neural network or a recurrent neural network. The embedded vector of speech may be a vector that captures semantic and speech features of the audio, and the embedded vector of speech may be a low-dimensional vector. The audio encoder, which may be referred to as wav2vec2.0, is a speech representation learning model based on self-supervised learning, which aims to convert speech signals into corresponding high quality speech representations, i.e. to speech encode waveforms for speech recognition, speech understanding and other speech related tasks.
S13, obtaining an embedded vector of a sample, wherein the sample is randomly sampled from Gaussian distribution.
Alternatively, the embedded vector of samples may refer to a representation of the samples mapped into a vector space. The sample may be a sample randomly sampled from a gaussian distribution.
S14, connecting the embedded vector of the voice and the embedded vector of the sample to obtain a connection vector.
Alternatively, the connection process may refer to connecting the embedded vector of the voice and the embedded vector of the sample, and in particular, the connection vector may be formed by connecting the embedded vector of the voice and the embedded vector of the sample in a specific dimension using a concatenation operation.
S15, processing the connection vector and the embedded vector of the voice through a deep learning model to obtain a noise vector and obtain a denoising vector of the connection vector based on the noise vector and the connection vector, wherein the deep learning model consists of an encoder and a first decoder based on a cross attention mechanism.
Alternatively, the noise vector may refer to a noise vector with respect to the connection vector, and the denoising vector may refer to a vector of the connection vector from which noise is removed. The cross-attention mechanism is a variation of the attention mechanism for handling associations and interactions between multiple input sequences. It is commonly used in natural language processing and computer vision tasks, particularly in sequence-to-sequence models and attention models.
Alternatively, the deep learning model may be a U2Net model.
S16, processing the denoising vector of the connection vector through a second decoder to acquire the target image.
Alternatively, the second decoder and the first decoder may be different types of decoders. The second Decoder may be a Decoder in a variant automatic encoder, which may be a VAE Decoder, i.e. a VAE-Decoder.
As can be seen from the above description, the image generating method according to the present embodiment includes: acquiring voice for generating a target image; processing the voice through an audio encoder to obtain an embedded vector of the voice; obtaining an embedded vector of a sample, wherein the sample is randomly sampled from Gaussian distribution; connecting the embedded vector of the voice and the embedded vector of the sample to obtain a connection vector; processing the connection vector and the embedded vector of the voice through a deep learning model to obtain a noise vector and a denoising vector of the connection vector based on the noise vector and the connection vector, wherein the deep learning model comprises an encoder and a first decoder based on a cross-attention mechanism; and processing the denoising vector of the connection vector through a second decoder to acquire the target image.
In the image generation method, the image can be directly generated according to the voice only by introducing the deep learning model, the cascade of tasks is not involved, the complexity of the whole image generation process is low, and the cascade of tasks is not involved, so that the problems of information loss and distortion easily existing between different tasks are avoided, and the precision of the finally generated target image is high.
As shown in fig. 3, the present embodiment provides an implementation method for processing the connection vector and the embedded vector of the voice through a deep learning model to obtain a noise vector and obtain a denoising vector of the connection vector based on the noise vector and the connection vector, including:
s1: and processing the connection vector and the embedded vector of the voice through the deep learning model to obtain a noise vector and obtaining a denoising vector of the connection vector based on the noise vector and the connection vector.
Optionally, the denoising vector of the connection vector is a vector obtained by subtracting the connection vector from the noise vector.
S2: if the current iteration number is smaller than the preset iteration number, turning to S1, adding 1 to the value of the current iteration number, wherein the connection vector in S1 is an updated connection vector, the updated connection vector is a denoising vector of the connection vector, and otherwise, acquiring the denoising vector of the connection vector.
Alternatively, the preset iteration number may be flexibly set according to practical situations, for example, the preset iteration number may be set to 500. The initial value of the current iteration number may be 1.
Optionally, the noise vector is a noise vector related to the connection vector, when turning from S2 to S1, the connection vector in S1 is an updated connection vector, the noise vector in S1 is a noise vector related to the updated connection vector, and the denoising vector of the connection vector in S1 may refer to a vector obtained by subtracting the updated connection vector from the noise vector. For example, the connection vector is u, the noise vector is v, and the denoising vector of the connection vector is u-v.
As can be seen from the above description, the implementation method of processing the connection vector and the embedded vector of the voice through the deep learning model to obtain a noise vector and obtain a denoising vector of the connection vector based on the noise vector and the connection vector according to the present embodiment includes: s1: processing the connection vector and the embedded vector of the voice through the deep learning model to obtain a noise vector and a denoising vector of the connection vector based on the noise vector and the connection vector; s2: if the current iteration number is smaller than the preset iteration number, turning to S1, wherein the connection vector in S1 is an updated connection vector, the updated connection vector is a denoising vector of the connection vector, and otherwise, obtaining the denoising vector of the connection vector.
In the method, the processes of S1 and S2 are repeatedly executed by setting the preset iteration times, so that the connection vector with the noise completely removed is obtained, and the precision of finally generating the target image is improved.
As shown in fig. 4, the implementation method for processing the connection vector and the embedded vector of the voice through the deep learning model to obtain the noise vector includes:
s21, performing first cross fusion processing on the connection vector and the embedded vector of the voice through the stacked encoders to obtain a fusion vector.
Optionally, the first cross fusion process may refer to a process that the encoder performs a combination according to features of at least two vectors, and a new feature vector generated after the combination may be the fusion vector.
Optionally, the implementation method for performing cross fusion processing on the connection vector and the embedded vector of the voice through the stacked encoders to obtain a fusion vector includes: and performing first cross fusion processing on a first query vector, a first value vector and a first key vector through the stacked encoders to obtain the fusion vector, wherein the first query vector and the first key vector are both connection vectors, and the first value vector is an embedded vector of the voice.
Optionally, the stacked encoders may include a forward encoder and a backward encoder, where the backward encoder is a first encoder after the forward encoder, and the first query vector, the first value vector, and the first key vector are subjected to a first cross fusion process by the stacked encoders to obtain an input of the backward encoder, such as an input of the forward encoder is the connection vector and an embedded vector of the voice, and an output of the forward encoder is a fusion vector of an output of the forward encoder, and the fusion vector of the output of the forward encoder and the connection vector are the input of the backward encoder.
Optionally, the encoder is implemented based on a cross-attention mechanism, and performing, by the stacked encoder, a first cross-fusion process on the first query vector, the first value vector, and the first key vector may refer to performing, by the stacked encoder, a first cross-fusion process on the first query vector, the first value vector, and the first key vector under the cross-attention mechanism. The query vector under the cross-attention mechanism may be a feature vector used to represent a target or query of interest to the current attention mechanism, the value vector may be a vector containing feature information, and the key vector may be a vector used to measure similarity between the query vector and the value vector.
S22, performing second cross fusion processing on the fusion vector and the connection vector through the stacked first decoder to acquire the noise vector.
Optionally, the deep learning model includes the stacked encoder and the stacked first decoder, the stacked encoder may specifically refer to a stacked encoder in a U2Net model, and the stacked first decoder may specifically refer to a stacked decoder in a U2Net model.
Optionally, the second cross fusion process may refer to a process that the first decoder performs a combination according to features of at least two vectors, and a new feature vector generated after the combination may be the noise vector. Since the processes of feature combining of vectors by the encoder and the decoder may be different, the cross-fusion process of the encoder and the first decoder is distinguished by the first cross-fusion process and the second cross-fusion process.
Optionally, the implementation method for obtaining the noise vector by performing a second cross fusion process on the fusion vector and the connection vector through the stacked first decoder includes: and performing second cross fusion processing on a second query vector, a second value vector and a second key vector through the first decoder to acquire the noise vector, wherein the second query vector and the second key vector are both the connection vector, and the second value vector is the fusion vector.
Optionally, the deep learning model includes the stacked encoders and the stacked first decoders. The principle of operation of the forward decoder and the backward decoder in the first decoder of the stack may be similar to that of the forward encoder and the backward encoder described above and will not be described again here.
Optionally, the first decoder is implemented based on a cross-attention mechanism, and performing, by the first decoder, the second cross-fusion process on the second query vector, the second value vector, and the second key vector may refer to performing, by the first decoder, the second cross-fusion process on the second query vector, the second value vector, and the second key vector under the cross-attention mechanism.
The first query vector and the second query vector are only used for representing different query vectors, and the first value vector, the second value vector, the first key vector and the second key vector are the same and are not described in detail herein.
For a clear illustration of the process of the image generation method, refer to fig. 5. Q1, K1, V1 in fig. 5 represent a first query vector, a first key vector, and a first value vector, respectively, and Q2, K2, V2 represent a second query vector, a second key vector, and a second value vector, respectively.
The protection scope of the image generating method according to the embodiment of the present application is not limited to the execution sequence of the steps listed in the embodiment, and all the schemes implemented by adding or removing steps and replacing steps according to the prior art made by the principles of the present application are included in the protection scope of the present application.
As shown in fig. 6, the present embodiment provides a voice-based image generation apparatus 600, the image generation apparatus 600 including:
the voice acquisition module 610 is configured to acquire voice for generating a target image.
The speech processing module 620 is configured to process the speech through an audio encoder to obtain an embedded vector of the speech.
The sample acquiring module 630 is configured to acquire an embedded vector of samples, where the samples are randomly sampled from a gaussian distribution.
And the connection processing module 640 is configured to perform connection processing on the embedded vector of the voice and the embedded vector of the sample, so as to obtain a connection vector.
A noise acquisition module 650, configured to process the connection vector and the embedded vector of the speech through a deep learning model to acquire a noise vector and acquire a denoising vector of the connection vector based on the noise vector and the connection vector, where the deep learning model is composed of an encoder and a first decoder based on a cross-attention mechanism.
An image acquisition module 660, configured to process, by a second decoder, the denoising vector of the connection vector to acquire the target image.
In the image generating apparatus 600 provided in this embodiment, the voice acquiring module 610 corresponds to step S11, the voice processing module 620 corresponds to step S12, the sample acquiring module 630 corresponds to step S13, the connection processing module 640 corresponds to step S14, the noise acquiring module 650 corresponds to step S15, and the image acquiring module 660 corresponds to step S16 of the image generating method shown in fig. 1.
In the several embodiments provided in this application, it should be understood that the disclosed apparatus or method may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of modules/units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple modules or units may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or modules or units, which may be in electrical, mechanical or other forms.
The modules/units illustrated as separate components may or may not be physically separate, and components shown as modules/units may or may not be physical modules, i.e., may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules/units may be selected according to actual needs to achieve the purposes of the embodiments of the present application. For example, functional modules/units in various embodiments of the present application may be integrated into one processing module, or each module/unit may exist alone physically, or two or more modules/units may be integrated into one module/unit.
Those of ordinary skill would further appreciate that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, in computer software, or in a combination of the two, and that the elements and steps of the examples have been generally described in terms of function in the foregoing description to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The embodiment provides an electronic device, which comprises a memory, wherein a computer program is stored in the memory; and the processor is in communication connection with the memory and executes the image generation method shown in fig. 2 when the computer program is called.
Embodiments of the present application also provide a computer-readable storage medium. Those of ordinary skill in the art will appreciate that all or part of the steps in the method implementing the above embodiments may be implemented by a program to instruct a processor, where the program may be stored in a computer readable storage medium, where the storage medium is a non-transitory (non-transitory) medium, such as a random access memory, a read only memory, a flash memory, a hard disk, a solid state disk, a magnetic tape (magnetic tape), a floppy disk (floppy disk), an optical disk (optical disk), and any combination thereof. The storage media may be any available media that can be accessed by a computer or a data storage device such as a server, data center, or the like that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., a floppy disk, a hard disk, a magnetic tape), an optical medium (e.g., a digital video disc (digital video disc, DVD)), or a semiconductor medium (e.g., a Solid State Disk (SSD)), or the like.
Embodiments of the present application may also provide a computer program product comprising one or more computer instructions. When the computer instructions are loaded and executed on a computing device, the processes or functions described in accordance with the embodiments of the present application are produced in whole or in part. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website, computer, or data center to another website, computer, or data center by a wired (e.g., coaxial cable, fiber optic, digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.).
The computer program product is executed by a computer, which performs the method according to the preceding method embodiment. The computer program product may be a software installation package, which may be downloaded and executed on a computer in case the aforementioned method is required.
The descriptions of the processes or structures corresponding to the drawings have emphasis, and the descriptions of other processes or structures may be referred to for the parts of a certain process or structure that are not described in detail.
The foregoing embodiments are merely illustrative of the principles of the present application and their effectiveness, and are not intended to limit the application. Modifications and variations may be made to the above-described embodiments by those of ordinary skill in the art without departing from the spirit and scope of the present application. Accordingly, it is intended that all equivalent modifications and variations which may be accomplished by persons skilled in the art without departing from the spirit and technical spirit of the disclosure be covered by the claims of this application.
Claims (10)
1. A voice-based image generation method, the image generation method comprising:
acquiring voice for generating a target image;
processing the voice through an audio encoder to obtain an embedded vector of the voice, wherein the embedded vector of the voice is a vector capturing the semantic and voice characteristics of the audio;
obtaining an embedded vector of a sample, wherein the sample is randomly sampled from Gaussian distribution;
connecting the embedded vector of the voice and the embedded vector of the sample to obtain a connection vector;
processing the connection vector and the embedded vector of the voice through a deep learning model to obtain a noise vector and a denoising vector of the connection vector based on the noise vector and the connection vector, wherein the deep learning model comprises an encoder and a first decoder based on a cross-attention mechanism;
and processing the denoising vector of the connection vector through a second decoder to acquire the target image.
2. The image generation method according to claim 1, wherein the implementation method of processing the connection vector and the embedded vector of the speech by a deep learning model to acquire a noise vector and acquiring a denoised vector of the connection vector based on the noise vector and the connection vector comprises:
s1: processing the connection vector and the embedded vector of the voice through the deep learning model to obtain a noise vector and a denoising vector of the connection vector based on the noise vector and the connection vector;
s2: if the current iteration number is smaller than the preset iteration number, turning to S1, adding 1 to the value of the current iteration number, wherein the connection vector in S1 is an updated connection vector, the updated connection vector is a denoising vector of the connection vector, and otherwise, acquiring the denoising vector of the connection vector.
3. The image generation method according to claim 1, wherein the implementation method of processing the connection vector and the embedded vector of the speech by a deep learning model to obtain a noise vector comprises:
performing a first cross fusion process on the connection vector and the embedded vector of the voice through the stacked encoders to obtain a fusion vector, wherein the first cross fusion process refers to a process that the encoders combine according to the characteristics of at least two vectors, and a new characteristic vector generated after combination is the fusion vector;
and performing second cross fusion processing on the fusion vector and the connection vector through the stacked first decoder to acquire the noise vector, wherein the second cross fusion processing refers to a processing process of combining the first decoder according to the characteristics of at least two vectors, and a new characteristic vector generated after combination is the noise vector.
4. The image generation method according to claim 3, wherein the implementation method of performing a first cross fusion process on the connection vector and the embedded vector of the speech by the stacked encoders to obtain a fusion vector comprises: and performing first cross fusion processing on a first query vector, a first value vector and a first key vector through the stacked encoders to obtain the fusion vector, wherein the first query vector and the first key vector are both connection vectors, and the first value vector is an embedded vector of the voice.
5. The image generation method according to claim 3, wherein the implementation method of performing a second cross-fusion process on the fusion vector and the connection vector by the stacked first decoder to acquire the noise vector comprises: and performing second cross fusion processing on a second query vector, a second value vector and a second key vector through the stacked first decoder to acquire the noise vector, wherein the second query vector and the second key vector are both the connection vector, and the second value vector is the fusion vector.
6. The image generation method according to claim 1, wherein the denoising vector of the connection vector is a vector obtained by subtracting the connection vector from the noise vector.
7. The image generation method according to claim 1, wherein the second decoder is a decoder in a variational automatic encoder.
8. A speech-based image generation apparatus, characterized in that the image generation apparatus comprises:
a voice acquisition module for acquiring voice for generating a target image;
the voice processing module is used for processing the voice through the audio encoder to obtain an embedded vector of the voice, wherein the embedded vector of the voice is a vector capturing the semantic and voice characteristics of the audio;
the sample acquisition module is used for acquiring an embedded vector of a sample, wherein the sample is randomly sampled from Gaussian distribution;
the connection processing module is used for carrying out connection processing on the embedded vector of the voice and the embedded vector of the sample so as to obtain a connection vector;
a noise acquisition module for processing the connection vector and the embedded vector of the speech through a deep learning model to acquire a noise vector and a denoising vector of the connection vector based on the noise vector and the connection vector, the deep learning model comprising an encoder and a first decoder based on a cross-attention mechanism;
and the image acquisition module is used for processing the denoising vector of the connection vector through a second decoder so as to acquire the target image.
9. A computer-readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the image generation method of any of claims 1-7.
10. An electronic device, the electronic device comprising:
a memory storing a computer program;
a processor in communication with the memory, which when invoked performs the image generation method of any one of claims 1-7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311580355.8A CN117292024B (en) | 2023-11-24 | 2023-11-24 | Voice-based image generation method and device, medium and electronic equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311580355.8A CN117292024B (en) | 2023-11-24 | 2023-11-24 | Voice-based image generation method and device, medium and electronic equipment |
Publications (2)
Publication Number | Publication Date |
---|---|
CN117292024A CN117292024A (en) | 2023-12-26 |
CN117292024B true CN117292024B (en) | 2024-04-12 |
Family
ID=89244727
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311580355.8A Active CN117292024B (en) | 2023-11-24 | 2023-11-24 | Voice-based image generation method and device, medium and electronic equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117292024B (en) |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7168953B1 (en) * | 2003-01-27 | 2007-01-30 | Massachusetts Institute Of Technology | Trainable videorealistic speech animation |
KR20190135853A (en) * | 2018-05-29 | 2019-12-09 | 한국과학기술원 | Method and system of text to multiple speech |
CN115643341A (en) * | 2022-10-14 | 2023-01-24 | 杭州半云科技有限公司 | Artificial intelligence customer service response system |
CN115758282A (en) * | 2022-11-07 | 2023-03-07 | 上海蜜度信息技术有限公司 | Cross-modal sensitive information identification method, system and terminal |
CN115937369A (en) * | 2022-11-21 | 2023-04-07 | 之江实验室 | Expression animation generation method and system, electronic equipment and storage medium |
CN116741197A (en) * | 2023-08-11 | 2023-09-12 | 上海蜜度信息技术有限公司 | Multi-mode image generation method and device, storage medium and electronic equipment |
CN116884427A (en) * | 2023-05-10 | 2023-10-13 | 华中科技大学 | Embedded vector processing method based on end-to-end deep learning voice re-etching model |
CN116932712A (en) * | 2023-06-30 | 2023-10-24 | 上海蜜度信息技术有限公司 | Multi-mode input interactive information generation method, device, equipment and medium |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2020256471A1 (en) * | 2019-06-21 | 2020-12-24 | 주식회사 머니브레인 | Method and device for generating speech video on basis of machine learning |
WO2020256472A1 (en) * | 2019-06-21 | 2020-12-24 | 주식회사 머니브레인 | Method and device for generating utterance video by using voice signal |
KR102483416B1 (en) * | 2020-08-25 | 2022-12-30 | 주식회사 딥브레인에이아이 | Method and apparatus for generating speech video |
US11682153B2 (en) * | 2020-09-12 | 2023-06-20 | Jingdong Digits Technology Holding Co., Ltd. | System and method for synthesizing photo-realistic video of a speech |
KR20230060266A (en) * | 2021-10-27 | 2023-05-04 | 삼성에스디에스 주식회사 | Method and apparatus for generating synthetic data |
-
2023
- 2023-11-24 CN CN202311580355.8A patent/CN117292024B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7168953B1 (en) * | 2003-01-27 | 2007-01-30 | Massachusetts Institute Of Technology | Trainable videorealistic speech animation |
KR20190135853A (en) * | 2018-05-29 | 2019-12-09 | 한국과학기술원 | Method and system of text to multiple speech |
CN115643341A (en) * | 2022-10-14 | 2023-01-24 | 杭州半云科技有限公司 | Artificial intelligence customer service response system |
CN115758282A (en) * | 2022-11-07 | 2023-03-07 | 上海蜜度信息技术有限公司 | Cross-modal sensitive information identification method, system and terminal |
CN115937369A (en) * | 2022-11-21 | 2023-04-07 | 之江实验室 | Expression animation generation method and system, electronic equipment and storage medium |
CN116884427A (en) * | 2023-05-10 | 2023-10-13 | 华中科技大学 | Embedded vector processing method based on end-to-end deep learning voice re-etching model |
CN116932712A (en) * | 2023-06-30 | 2023-10-24 | 上海蜜度信息技术有限公司 | Multi-mode input interactive information generation method, device, equipment and medium |
CN116741197A (en) * | 2023-08-11 | 2023-09-12 | 上海蜜度信息技术有限公司 | Multi-mode image generation method and device, storage medium and electronic equipment |
Non-Patent Citations (1)
Title |
---|
一种新的基于语音波形图像模板匹配的S/U/V判别法;邓立新, 杨震, 郑宝玉;南京邮电学院学报(自然科学版)(第01期);38-42 * |
Also Published As
Publication number | Publication date |
---|---|
CN117292024A (en) | 2023-12-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110321958B (en) | Training method of neural network model and video similarity determination method | |
CN112633419B (en) | Small sample learning method and device, electronic equipment and storage medium | |
CN111460812B (en) | Sentence emotion classification method and related equipment | |
TWI776462B (en) | Image processing method, electronic device and computer readable storage medium | |
CN110070867B (en) | Speech instruction recognition method, computer device and computer-readable storage medium | |
CN113468344A (en) | Entity relationship extraction method and device, electronic equipment and computer readable medium | |
CN111275166B (en) | Convolutional neural network-based image processing device, equipment and readable storage medium | |
CN114627868A (en) | Intention recognition method and device, model and electronic equipment | |
CN114155388B (en) | Image recognition method and device, computer equipment and storage medium | |
CN111126056A (en) | Method and device for identifying trigger words | |
CN114065915A (en) | Network model construction method, data processing method, device, medium and equipment | |
CN117292024B (en) | Voice-based image generation method and device, medium and electronic equipment | |
CN111859933A (en) | Training method, recognition method, device and equipment of Malay recognition model | |
CN114898734B (en) | Pre-training method and device based on voice synthesis model and electronic equipment | |
CN116468038A (en) | Information extraction method, method and device for training information extraction model | |
CN110675865A (en) | Method and apparatus for training hybrid language recognition models | |
CN115565529A (en) | 3D model control method, device, equipment and storage medium based on voice recognition | |
CN115049546A (en) | Sample data processing method and device, electronic equipment and storage medium | |
CN114998668A (en) | Feature extraction method and device, storage medium and electronic equipment | |
CN114155868A (en) | Voice enhancement method, device, equipment and storage medium | |
CN113222113B (en) | Signal generation method and device based on deconvolution layer | |
CN110808035B (en) | Method and apparatus for training hybrid language recognition models | |
CN115440198B (en) | Method, apparatus, computer device and storage medium for converting mixed audio signal | |
CN117474037B (en) | Knowledge distillation method and device based on space distance alignment | |
CN114330512B (en) | Data processing method, device, electronic equipment and computer readable storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |