WO2022179264A1 - 一种音频生成方法及设备 - Google Patents

一种音频生成方法及设备 Download PDF

Info

Publication number
WO2022179264A1
WO2022179264A1 PCT/CN2021/138568 CN2021138568W WO2022179264A1 WO 2022179264 A1 WO2022179264 A1 WO 2022179264A1 CN 2021138568 W CN2021138568 W CN 2021138568W WO 2022179264 A1 WO2022179264 A1 WO 2022179264A1
Authority
WO
WIPO (PCT)
Prior art keywords
target
audio
spectrogram
grayscale
image
Prior art date
Application number
PCT/CN2021/138568
Other languages
English (en)
French (fr)
Inventor
闫震海
Original Assignee
腾讯音乐娱乐科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯音乐娱乐科技(深圳)有限公司 filed Critical 腾讯音乐娱乐科技(深圳)有限公司
Publication of WO2022179264A1 publication Critical patent/WO2022179264A1/zh
Priority to US18/238,184 priority Critical patent/US20230402054A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/018Audio watermarking, i.e. embedding inaudible data in the audio signal
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T11/002D [Two Dimensional] image generation
    • G06T11/20Drawing from basic elements, e.g. lines or circles
    • G06T11/203Drawing of straight lines or curves
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T11/002D [Two Dimensional] image generation
    • G06T11/20Drawing from basic elements, e.g. lines or circles
    • G06T11/206Drawing of charts or graphs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/90Dynamic range modification of images or parts thereof
    • G06T5/92Dynamic range modification of images or parts thereof based on global image properties
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers

Definitions

  • the present application relates to the technical field of audio processing, and in particular, to an audio generation method and device.
  • the picture is directly used as the cover of an audio file, and then the picture and audio are stored in a new file format, so that the user can directly display the picture when playing the audio.
  • the picture is only used as the cover picture of the audio, the correlation between the picture and the audio is relatively low, and the practicability is poor.
  • the embodiments of the present application provide a method and device for generating audio based on image processing, which can achieve the purpose of embedding image information in audio, so that the image has a sound-emitting function, and at the same time, the audio can contain image information, which greatly improves the audio and Relevance of images.
  • the present application implements and discloses an audio generation method, the method comprising:
  • the audio generation instruction is used to indicate the two-dimensional image that the user wants to embed in the generated target audio
  • the target audio corresponding to the target spectrogram is generated by using the target spectrogram.
  • the embodiment of the present application provides a kind of audio generation equipment, and this equipment comprises:
  • processors and memory being interconnected, wherein the memory is used to store a computer program, the computer program including program instructions, the processor is configured to invoke the program instructions and execute the following step:
  • the audio generation instruction is used to indicate the two-dimensional image that the user wants to embed in the generated target audio
  • the target audio corresponding to the target spectrogram is generated by using the target spectrogram.
  • an embodiment of the present application provides a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and the computer program includes program instructions that, when executed by a processor, cause the The processor performs the following steps:
  • the audio generation instruction is used to indicate the two-dimensional image that the user wants to embed in the generated target audio
  • the target audio corresponding to the target spectrogram is generated by using the target spectrogram.
  • the target grayscale image of the two-dimensional image that the user wants to be embedded in the generated target audio can be obtained by responding to the audio generation instruction, and the target grayscale image can be stored in the target audio.
  • the grayscale data of each pixel is converted into the frequency domain data of each pixel in the spectrogram, and the target spectrogram is obtained, that is, the two-dimensional image is associated with the target spectrogram of the target audio, and then the target spectrogram is used.
  • the target audio corresponding to the target spectrogram is generated, so as to realize the generation of the target audio according to the two-dimensional image.
  • the embodiment of the present application can achieve the purpose of embedding image information in the audio, so that the image has a sound-emitting function, and at the same time, the audio can contain image information, which greatly improves the correlation between the audio and the image.
  • FIG. 1 is a schematic flowchart of an audio generation method provided by an embodiment of the present application.
  • FIG. 2 is a schematic flow chart of obtaining a target grayscale image provided by an embodiment of the present application
  • FIG. 3 is a schematic diagram of the effect of an image processing process provided by an embodiment of the present application.
  • FIG. 4 is a schematic flowchart of a target spectrogram synthesizing audio provided by an embodiment of the present application
  • FIG. 5 is a schematic flowchart of another audio generation method provided by an embodiment of the present application.
  • FIG. 6 is a schematic flowchart of a method for obtaining an original spectrogram provided by an embodiment of the present application
  • FIG. 7a is a schematic diagram of the effect of a target spectrogram provided by an embodiment of the present application.
  • 7b is a schematic diagram of the effect of another target spectrogram provided by an embodiment of the present application.
  • Fig. 8a is a kind of target spectrogram example diagram provided by the embodiment of the present application.
  • Fig. 8b is another target spectrogram example diagram provided by the embodiment of the present application.
  • FIG. 9 is a schematic flowchart of still another audio generation method provided by an embodiment of the present application.
  • FIG. 10 is a schematic structural diagram of an audio generation apparatus provided by an embodiment of the present application.
  • FIG. 11 is a schematic structural diagram of an audio generation device provided by an embodiment of the present application.
  • the embodiments of the present application can embed image information in the audio, for example, transform or construct a spectrogram according to the image information, so as to obtain audio with image information, improve the correlation between the image and the audio, and the user can intuitively feel the content of the audio Audio for image information.
  • a spectrogram may refer to a speech spectrogram.
  • the abscissa of the spectrogram can be the time
  • the ordinate can be the frequency
  • the value of each coordinate point can represent the energy value of the speech data
  • a column of data corresponding to each time point in the spectrogram represents a certain frame of audio signal. frequency domain data.
  • the size of the energy value of the speech data is usually represented by the depth of the color, and the darker the color, the greater the energy value, or other methods, which are not limited in this application.
  • the audio generation solution involved in this application can be used in an audio generation device, for example, can be specifically applied to various types of audio software installed in the audio generation device, including but not limited to music playback software, audio editing software, audio conversion software, and the like.
  • the audio generation device may be a terminal, or may be a server, or may be other devices, which are not limited in this application.
  • the terminal here may include, but is not limited to, a smart phone, a tablet computer, a laptop computer, a desktop computer, and so on.
  • the audio generation method, device, device and medium proposed in the embodiments of the present application can obtain audio with image information in the form of transforming or constructing a spectrogram using image information. Therefore, the purpose of embedding image information in the audio is realized, so that the image has a sound-emitting function, and at the same time, the audio can contain image information, which greatly improves the correlation between the audio and the image. Detailed descriptions are given below.
  • FIG. 1 is a schematic flowchart of an audio generation method provided by an embodiment of the present application.
  • the flow shown in FIG. 1 may include the following steps S101-S104.
  • the audio generation instruction can be used to indicate the two-dimensional image that the user wants to embed in the generated target audio.
  • the two-dimensional image may be an existing image stored in a picture format, may also be content created in a temporary creation area, or may be multiple two-dimensional images used to capture changes in user actions.
  • the content that the user wants to embed in the generated target audio is a file in a non-picture format such as text and tables
  • the non-picture format file can be converted into a picture format, and then the converted image in the picture format can be embedded in the target audio .
  • the image format may be a static image file format, such as jpg, png, bmp, jpeg, etc., which is not limited here. For example, obtain the file that needs to be embedded in the target audio, determine the suffix name of the file, and convert it into a picture format if it is not in a picture format, such as vsd, xls, doc, etc.
  • the target grayscale image may be obtained by acquiring a two-dimensional image and processing the two-dimensional image, or may be obtained by directly acquiring and processing a grayscale image from a memory as the target grayscale image, which is not limited in this application.
  • the target grayscale image may also be called a target grayscale image, target grayscale information, target grayscale matrix, etc.
  • the target grayscale image may be a grayscale data matrix, a block diagram with pixel values, and the like.
  • the value of each position in the target grayscale image may be referred to as a grayscale value, a pixel value, etc., which is not limited here.
  • acquiring a two-dimensional image and processing the two-dimensional image may include: acquiring an original grayscale image of the two-dimensional image, performing proportional scaling processing on the original grayscale image, and performing proportional scaling on the original grayscale image. Histogram equalization, normalization of the original grayscale image, etc.
  • acquiring the target grayscale image of the two-dimensional image may include the following steps S201-S202.
  • the original grayscale image of the two-dimensional image may be a grayscale image concept in the field of image processing.
  • the grayscale level of each pixel in the image is 256, where 255 represents all white, and 0 represents all black.
  • the original grayscale image of a two-dimensional image after grayscale processing is (0, 100, 123; 215, 124, 165; 255, 65, 98).
  • the original grayscale image of the two-dimensional image is denoted as GrayP1 here, and the height of the picture is H1.
  • the proportional scaling processing may be performing proportional scaling processing according to a scaling factor.
  • the proportional scaling process aims to adjust the height H1 of the original grayscale image GrayP1 of the two-dimensional image, and obtain the proportionally scaled grayscale image.
  • the proportionally scaled grayscale image is recorded as GrayP2, and the The high note is H2.
  • the height H2 of the grayscale image after proportional scaling is a preset value, which can be calculated according to the height H2 of the grayscale image after proportional scaling and the height H1 of the original grayscale image of the two-dimensional image.
  • the proportional scaling factor can be used to proportionally adjust the original grayscale image to an appropriate size, so that the target audio can be obtained by transforming or constructing the original audio through the final generated target grayscale image.
  • the height H2 of the grayscale image GrayP2 after the proportional scaling process may be 2 ⁇ N+1, where N is a preset positive integer.
  • the height H2 of the proportionally scaled grayscale image can be determined according to the height of the target spectrogram corresponding to the target audio generated by the user, or can be determined according to the frequency domain data of the original spectrogram, or can be determined according to the device screen
  • the size and/or resolution are determined, or may be determined in other ways, which are not limited in this application.
  • GrayP3 is the grayscale data matrix, namely:
  • GrayP3 GrayP2/max(GrayP2).
  • the target grayscale image GrayP3 is (0.2, 0.3, 0.4; 0.5; 0.6, 0.7; 0.8, 0.9,1).
  • the grayscale data matrix GrayP3 of the target grayscale image of the two-dimensional image is obtained, wherein all the data of GrayP3 are between 0-1.
  • Figure 3 shows an effect display diagram of an image processing process. A color image is converted to grayscale to obtain the original grayscale image of a two-dimensional image, and then scaled to obtain a grayscale image after proportional scaling. The target grayscale image of the two-dimensional image is obtained after normalization.
  • a histogram equalization process may also be performed on the grayscale image GrayP2 after the proportional scaling process, so as to enhance the contrast of the data in different positions in the GrayP2 and improve the picture quality.
  • functions can be directly called for processing, such as the histeq function in MATLAB, the equalizeHist function in opencv, and so on.
  • the grayscale image processed by histogram equalization can be normalized to obtain the target grayscale image of the two-dimensional image.
  • the two-dimensional image meets the processing result criteria of the above steps S201-S202, the two-dimensional image does not need to perform the operations of the above-mentioned steps S201-S202, and the two-dimensional image is directly used as the target grayscale image.
  • the two-dimensional image may include a plurality of two-dimensional images used to capture changes in user actions, and the changes in user actions may refer to changes in user gestures, changes in facial expressions, etc., which are not limited here.
  • the acquisition of the target grayscale image of the two-dimensional image may include the following steps: respectively calculating the grayscale difference values between the two-dimensional images whose acquisition time is adjacent in the plurality of two-dimensional images, to obtain a plurality of grayscale difference values; The plurality of grayscale difference values are arranged according to the acquisition time corresponding to the grayscale difference values to obtain a target grayscale image.
  • the collection sources of the multiple 2D images may be videos captured in real time, videos stored in an audio generation device such as a terminal or other storage devices, or multiple images captured continuously, which are not limited here.
  • the grayscale difference value may be a difference value between target grayscale images corresponding to two-dimensional images adjacent at the acquisition time. For example, if there are multiple two-dimensional images in a video, one two-dimensional image is collected at time points t1, t2, and t3 to obtain three two-dimensional images P1, P2, and P3. According to steps S201-S202, the target of the three two-dimensional images is obtained.
  • Grayscale image calculate the grayscale difference between P1 and P2, and the grayscale difference between P2 and P3, and arrange the two grayscale differences according to the acquisition time, such as the grayscale difference between P1 and P2
  • the value is ranked before the grayscale difference between P2 and P3, that is, the position to the left, so that the target grayscale images corresponding to the plurality of two-dimensional images used to collect user action changes are obtained.
  • S103 Convert the grayscale data of each pixel in the target grayscale image into frequency domain data of each pixel in the spectrogram to obtain a target spectrogram.
  • the embodiment of the present application obtains the target spectrogram mainly in two ways.
  • the original spectrogram of the original audio is transformed based on the target grayscale image of the two-dimensional image to obtain the target spectrogram,
  • a target grayscale image such as a grayscale data matrix GrayP3 can be used as a weighting factor to weight the original spectrogram of the original audio to obtain a target spectrogram;
  • the grayscale data matrix GrayP3 can be directly used as the frequency domain data to obtain the target spectrogram.
  • the image and the audio can be closely related, and the correlation between the audio and the image is greatly improved.
  • the target audio refers to the generated audio in which image information such as the above-mentioned two-dimensional image information is embedded.
  • using the target spectrogram to generate the target audio corresponding to the target spectrogram may include the following steps: acquiring a time domain signal corresponding to each frame of frequency domain data of the target spectrogram; Time domain signal to get the target audio.
  • the frequency domain data of each frame of the target spectrogram can be inverted up and down, and the complex number of the inverted frequency domain data can be conjugated; the inverse Fourier can be performed on each frame of the conjugated frequency domain data.
  • Leaf transform obtain the time domain signal corresponding to each frame of frequency domain data, and synthesize the time domain signal of each frame into the target audio.
  • the frequency-domain data of the target spectrogram since the frequency-domain data of the target spectrogram has conjugate pairwise, when synthesizing the frequency-domain data of the target spectrogram into a time-domain signal, if the frequency-domain data of each frame of the spectrogram has 2 ⁇ N+1 data, only need to flip up and down the 2nd to 2nd ⁇ N/2 data of the frequency domain data, and take the conjugate of the complex number of the frequency domain data after the flipping process, the N is a positive integer . For example, if each frame of frequency domain data of the target spectrogram has 1025 data, only the second to 512th data needs to be inverted up and down, and the complex number of the inverted frequency domain data is conjugated. Then, inverse Fourier transform can be performed on each frame of frequency-domain data after conjugation to obtain the time-domain signal corresponding to each frame of frequency-domain data, then each frame of frequency-domain data of the target spectrogram is converted into time domain signal.
  • the time domain signal of each frame can be aliased and spliced according to a certain aliasing rate to obtain a complete audio signal.
  • the audio represented by the audio signal may be referred to as target audio.
  • the target audio is embedded with image information, and the user can intuitively feel the changes brought by the image information to the original audio or the unique sound directly formed by the image information.
  • the process of step S104 is shown in FIG. 4 , the target spectrogram is composed of multiple frames of frequency domain data, each frame of frequency domain data is converted into a corresponding time domain signal, and the multiple frames of time domain signals are aliased and spliced into an audio signal.
  • the target audio after obtaining the target audio, receive an audio playback instruction input by the user; in response to the audio playback instruction, play the target audio and display the target language corresponding to the playback progress according to the playback progress of the target audio Spectrum.
  • the embedded image can be displayed bit by bit in association with the playback progress of the audio.
  • the target audio when a playback instruction for the target audio is received, the target audio can be played, and when the playback reaches time point t1, the target spectrogram of the corresponding area between 0 and t1 is displayed, and when the playback reaches time point t2, Display the target language spectrogram of the corresponding area between 0-t2, and when the playback is completed, display the complete target language spectrogram.
  • the target audio when a sharing instruction for the target audio is received, the target audio can be shared with the target object, and the target object can be a contact or a function module in the application software, which is not limited here. .
  • the target audio with image information can be obtained.
  • the target spectrogram of the target audio can be gradually displayed along with the playing of the music, so that the user can intuitively see the embedded audio Image information, and the obtained target audio can also be shared with other users.
  • the user imports a picture a and a piece of audio b from the terminal.
  • the audio c embedded with the image a can be obtained.
  • the audio c is playing, it will follow the music.
  • the playback of the audio gradually reveals the spectrogram of the audio, so that the user can intuitively see the embedded image information.
  • the user uses the camera of the terminal to shoot a video with dynamic changes.
  • a plurality of two-dimensional images representing the changes of the user's actions are intercepted from the dynamic video.
  • Audio d is obtained after two-dimensional image processing, and audio d presents sound effects brought by dynamic changes.
  • the embodiments of the present application introduce the technical solutions as a whole.
  • the methods of acquiring audio according to image information in the present application can be divided into two types.
  • the main difference lies in the acquisition methods of the target spectrogram.
  • One is to transform the target grayscale image.
  • the audio is obtained by the method of spectrogram; the other is to obtain the audio by constructing the spectrogram from the target grayscale image. Therefore, the target audio can be obtained by transforming or constructing the spectrogram.
  • Image information is embedded in the audio, and the image information is closely combined with the audio, so that the image has the function of sounding, and the image information is included in the sound.
  • Image information that is, the audio spectrogram contains image information.
  • the purpose of embedding image information in the audio can be achieved, so that the image has the function of sounding, and at the same time, the audio can contain image information, which greatly improves the correlation between the audio and the image, and the operation process has a strong flexibility and fun.
  • FIG. 5 is a schematic flowchart of another audio generation method provided by an embodiment of the present application.
  • the audio generation method transforms the spectrogram of the audio based on the target grayscale image of the two-dimensional image to obtain the target spectrogram, and then obtains the target audio, including the following steps S501-S504.
  • S501 Receive an audio generation instruction input by a user, and in response to the audio generation instruction, acquire a target grayscale image of the two-dimensional image.
  • the spectrogram of the original audio can be transformed based on the target grayscale image of the two-dimensional image to obtain the target spectrogram, then the original grayscale image of the two-dimensional image can be scaled proportionally During processing, the height of the original grayscale image is scaled to the same height as the original spectrogram.
  • the audio selection instruction is used to indicate the original audio required for generating the target audio.
  • the original audio may be a locally stored audio file or an audio file temporarily downloaded on other storage devices.
  • the content of the audio file may be music, conversation content, noise, etc., which is not limited in this application.
  • the process of obtaining the original spectrogram according to the original audio may be as shown in FIG. 6 .
  • the time-domain signal of the original audio can be divided into frames to obtain multi-frame time-domain signals; where the frame length is the time length of each frame, and the frame shift is the time length of the overlap of two adjacent frames, for example, the kth
  • the start time of the frame time domain signal is t
  • the end time is t+E
  • the start time of the k+1 frame time domain signal is t+L
  • the end time is t+E+L
  • the frame length is E, the frame Shift to L
  • do window processing for each frame of time domain signal, the length of the window function should be consistent with the length of the frame length, and the window function can use Hanning window, rectangular window, triangular window, Hamming window, Gaussian window, etc.
  • Perform fast Fourier transform (FFT) on each frame of the multi-frame windowed time-domain signal to obtain multi-frame frequency-domain data;
  • Spectrum For example, when arranging, gradually increase all frequency domain data from bottom to top according to frequency, and arrange all frequency domain data horizontally in chronological order to obtain the original spectrogram.
  • the horizontal axis of the original spectrogram is time, and the vertical axis is Frequency, the value of the coordinate point is the energy value, and the size of the energy value is represented by the color depth.
  • the time-domain signal after each frame of windowing when performing fast Fourier transform (FFT) on the time-domain signal after each frame of windowing to obtain multi-frame frequency-domain data, if the time-domain signal after each frame is windowed is 2 ⁇ K
  • the numerical value can reduce the time complexity of the Fourier transform, thereby improving the operational efficiency of the Fourier transform.
  • the frequency-domain data corresponding to each frame of the time-domain signal has (2 ⁇ K/2)+1 values, where K is a positive integer.
  • K is a positive integer.
  • N is an integer greater than or equal to 0.
  • the grayscale data of each pixel in the target grayscale image can be represented by a grayscale data matrix, and in the grayscale data matrix, each value represents the value of a pixel at a corresponding position in the target grayscale image.
  • using the grayscale data of each pixel in the target grayscale image to process the frequency domain data of each pixel in the original spectrogram to obtain the target spectrogram which may include the following operations:
  • the data matrix is inverted up and down; the inverted gray data matrix is used as a weighting factor, and the frequency domain data of each pixel in the original spectrogram is weighted to obtain the target spectrogram.
  • the upside-down inversion process may represent upside down inversion of the grayscale data matrix according to the Y-axis direction.
  • the grayscale data matrix is (0.1, 0.2, 0.3; 0.4, 0.5, 0.6; 0.7, 0.8, 0.9)
  • the grayscale data matrix after upside-down processing is (0.7, 0.8, 0.9; 0.4, 0.5, 0.6; 0.1, 0.2, 0.3).
  • the frequency domain data of each pixel in the original spectrogram can be weighted by weighting the frequency domain data of each pixel in the original spectrogram, but all data in the frequency domain data can be weighted, but the frequency domain data of the original spectrogram has conjugate symmetry.
  • the effect of the obtained target spectrogram is shown in Figure 7a, the part surrounded by the dotted line is the embedded two-dimensional image, the outside of the dotted line is the frequency domain data of the original spectrogram, the horizontal axis of the original spectrogram is time, and the vertical axis is time. The axis is the frequency, and the depth of the color represents the energy value of the corresponding coordinate point. It can be seen that the height of the embedded two-dimensional image is equal to the height of the original spectrogram, because step S501 scales the height of the original grayscale image to be equal to the height of the original spectrogram.
  • the inverted grayscale data matrix can be down-sampled to reduce the size of the grayscale data matrix, and the downsampled grayscale data matrix can be used as a weighting factor to perform part of the frequency domain data of the original spectrogram. Weighted to obtain the target spectrogram, so that the two-dimensional image can be embedded in the local position of the original spectrogram.
  • the frequency domain data has 2 ⁇ N+1 data
  • the height of the grayscale data matrix has 2 ⁇ N+1 pixels
  • the grayscale data matrix is downsampled, and the downsampling factor is 1/2
  • the grayscale data The height of the matrix becomes 2 ⁇ N/2+1
  • the Mth to M+2 ⁇ N/2+1 data of the frequency domain data can be weighted, and the obtained target spectrogram is only the Mth of the frequency domain data.
  • the M+2 ⁇ N/2+1th data contains image information, and M and N are positive integers.
  • the effect of the target spectrogram obtained through this step can be shown in Figure 7b, the part enclosed by the dotted line is the embedded image, and the outside of the dotted line is the frequency domain data of the original original spectrogram, and the horizontal axis of the original spectrogram is time. , the vertical axis is the frequency, and the depth of the color represents the energy value of the corresponding coordinate point. It can be seen that the height of the embedded two-dimensional image is not equal to the height of the original spectrogram, and the embedded image only exists in the local position of the original spectrogram.
  • the weighting factor is scaled to a smaller value, after weighting the original spectrogram, the impact of the embedded information on the original audio will be very small, the synthesized target audio is roughly the same as the original audio, and the image information can be embedded in the target in audio.
  • step S104 when synthesizing each frame of time domain signal into audio, the aliasing rate can be determined according to the frame shift and frame length of the frame division processing in step S502.
  • the time domain signals of each frame are aliased and spliced together to obtain a complete audio signal, which is the target audio.
  • the grayscale data matrix is used as a weighting factor to weight the frequency domain data of the original spectrogram to obtain the target spectrogram.
  • the time domain signal is obtained by Fourier transform, and then the time domain signal is aliased and spliced to finally obtain the target audio, that is, the audio is obtained by transforming the original spectrogram. It can be seen that through the transformation of the original spectrogram, the target audio can embed the image in the audio, so that the image has the function of sounding, and the audio can also contain image information, which greatly improves the correlation between the audio and the image.
  • the following describes the method described in the implementation by applying the method proposed in the embodiment of the present application to creating an image in a certain music playing software and transforming the original spectrogram to obtain a new audio as an example.
  • the music playing software here includes, but is not limited to, a mobile phone terminal, a computer terminal, and the like.
  • a temporary creation area is provided, the user creates content in the temporary creation area, and saves the created content as a picture format, and at the same time, the user selects the audio file to be transformed.
  • the created image is processed in step S501 to obtain a target grayscale image, wherein the height of the target grayscale image is scaled to be 2 ⁇ 10+1 pixels, and this data setting is in order to contrast with the height of the original spectrogram;
  • the original spectrogram of the audio file is obtained according to step S502.
  • the frame length is 30ms
  • the length of the window function is the same as the frame length of 30ms.
  • each frame of frequency domain data of the original spectrogram has 1025 data, and only the second data of each frame of frequency domain data of the original spectrogram can be
  • the entire frequency domain data can be weighted to obtain the target spectrogram;
  • the 2nd to 512th data of each frame of the frequency domain data of the target spectrogram are processed up and down, Conjugate the complex number of the frequency-domain data after the flipping process; perform inverse Fourier transform on each frame of frequency-domain data after conjugation to obtain the time-domain signal corresponding to each frame of frequency-domain data;
  • the time domain signal of each frame is synthesized into the target audio at a rate of 15ms/30ms, that is, 50% of the ratio of the frame shift to the frame length.
  • the final generated target audio file contains the content created in the authoring area.
  • the height of the target spectrogram of the target audio is consistent with the height of the target grayscale image of the embedded two-dimensional image.
  • the target spectrogram of the target language is shown in Figure 8a and 8b. It can be seen that in the target language spectrogram, the two-dimensional image is a part of the target language spectrogram, but from the frequency axis, the higher The height of the target spectrogram is the same as the height of the target spectrogram, and the magnitude of the energy value of the target spectrogram corresponds to the grayscale data of each pixel of the target grayscale image of the two-dimensional image.
  • the generated target audio can also be shared with other users, and the audio effect after embedding the image can be shared with friends.
  • the user selects an image to be embedded with audio, and at the same time, selects the original audio file to be transformed.
  • the image is processed in step S501 to obtain a target grayscale image, wherein the height of the target grayscale image is scaled to 2 ⁇ 10+1 pixels; at the same time, the original spectrogram of the original audio file is obtained according to step S503, and the original audio
  • the frame length is 40ms
  • the frame shift is 20ms.
  • the length of the window function is the same Hanning window as the frame length of 40ms;
  • the grayscale data matrix becomes 513*513, and each frame of frequency domain data of the original spectrogram has 1025 data.
  • Part of the frequency domain data is weighted. For example, if the size of the grayscale data matrix after downsampling is 513*513, the 100th to 612th data in the frequency domain data can be weighted to obtain the target language spectrogram.
  • Spectrogram only contains image information from the 100th to 612th data of frequency domain data
  • the 100th to 612th data can be other continuous frequency domain data, such as 200th to 712th data, 313th to 825th data data etc.
  • the target spectrogram is processed according to steps S505-S506, wherein, since the real number signal has conjugate symmetry, the second to 512th data of each frame of frequency domain data of the target spectrogram are up and down to be inverted.
  • the time domain signal of each frame is synthesized into the target audio, and the aliasing rate is 20ms/40ms, that is, 50%, the ratio of the frame shift to the frame length.
  • the final generated target audio file contains the information of the imported image.
  • the height of the target spectrogram of the target audio is inconsistent with the height of the embedded image.
  • the image is a part of the target spectrogram, and from the perspective of the frequency axis, the height of the image only accounts for a part of the height of the target spectrogram.
  • the grayscale data of the pixel corresponds to.
  • the generated target audio can also be shared with other users, and the audio effect after embedding the image can be shared with friends.
  • multiple two-dimensional images can be acquired as the two-dimensional images that need to be embedded in the original audio.
  • the grayscale difference values between the two-dimensional images whose acquisition time is adjacent in the plurality of two-dimensional images can be calculated respectively to obtain multiple grayscale difference values; the multiple grayscale difference values are collected according to the corresponding grayscale difference values.
  • the target grayscale image is obtained by arranging in time, and then using the grayscale data of each pixel in the target grayscale image to process the frequency domain data of each pixel in the original spectrogram corresponding to the original audio to obtain the target spectrogram.
  • the target grayscale images M1, M2, M3 corresponding to the three two-dimensional images are obtained, and the target grayscale images M1, M2, and M3 corresponding to the three two-dimensional images are obtained at intervals of two two-dimensional images collected at adjacent times.
  • the grayscale images are differenced to obtain two grayscale difference values: M2-M1 and M3-M2.
  • the two grayscale difference values are arranged in time order, thereby obtaining the target grayscale images corresponding to the multiple two-dimensional images.
  • the original spectrogram of an original audio again according to step S502 use this target grayscale image as a weighting factor to weight the frequency domain data of the original spectrogram according to the operation of step S503, thereby obtain the target spectrogram, and then according to the target Spectrogram to get the target audio.
  • the original audio can be transformed through a plurality of two-dimensional images, so that the original audio contains the change information of the images in the video.
  • FIG. 9 is a schematic flowchart of another audio generation method provided by an embodiment of the present application.
  • the audio generation method constructs (constructs) a target spectrogram of the audio based on the target grayscale image of the two-dimensional image, thereby obtaining the target audio, including the following steps S901-S903.
  • performing the upside-down inversion process on the grayscale data matrix may indicate that the grayscale data matrix is upside down according to the Y-axis direction.
  • the grayscale data matrix is (0.1, 0.2, 0.3; 0.4, 0.5, 0.6; 0.7, 0.8, 0.9)
  • the grayscale data matrix after upside-down processing is (0.7, 0.8, 0.9; 0.4, 0.5, 0.6; 0.1, 0.2, 0.3).
  • the inverted grayscale data matrix is used as the frequency domain data of each pixel in the target spectrogram, in other words, the data of the grayscale data matrix is taken as the corresponding position of the target spectrogram
  • the pixel data of that is, the energy value corresponding to each pixel in the target spectrogram
  • the energy value can be represented by the color in the target spectrogram. Or the magnitudes of different energy values represented by different hues, which are not limited here.
  • the grayscale data matrix is used as the frequency domain data, the greater the value of the grayscale data matrix, the greater the energy value of the corresponding target language spectrogram. For example, it is preset in the obtained target spectrogram that the larger the energy value, the darker the color.
  • the grayscale data matrix GrayP3 is (0.7, 0.8, 0.9; 0.4, 0.5, 0.6; 0.1, 0.2, 0.3)
  • the corresponding energy value greater than the gray value and smaller than 0.9 is converted into the energy value of the frequency domain data, so that the target spectrogram has the corresponding position of 0.9.
  • the color of the color is darker than the color of the corresponding position of other data, and the embedded two-dimensional image can be represented in the target spectrogram through this color depth relationship.
  • the grayscale data matrix is used as the frequency domain data, the smaller the value of the grayscale data matrix, the greater the energy corresponding to the target language spectrogram.
  • the scale factor can be used to adjust the numerical value of the grayscale data matrix, so as to adjust the energy size of the obtained target speech spectrogram.
  • the grayscale data matrix GrayP3 after the flipping process is (0.7, 0.8, 0.9; 0.4, 0.5, 0.6; 0.1, 0.2, 0.3), and the scale factor is 1.1, the grayscale data matrix becomes (0.77, 0.88, 0.99; 0.44, 0.55, 0.66; 0.11, 0.22, 0.33).
  • step S104 wherein because the acquisition method of the target spectrogram in this embodiment is to directly use the grayscale data matrix as the frequency domain data of the target spectrogram, instead of using the grayscale data matrix to the original spectrogram Weighting processing is performed, when each frame of time domain signal is aliased and spliced according to the aliasing rate, it is only necessary to select the aliasing rate from 0-100% (excluding 100%) for aliasing to obtain a complete audio signal.
  • the audio signal is the target audio.
  • the grayscale data matrix of the target grayscale image is used as the frequency domain data to obtain the target spectrogram, and Fourier transform is performed on each frame of frequency domain data of the target spectrogram.
  • the time-domain signal is obtained, and then the time-domain signal is aliased and spliced to finally obtain the target audio file, that is, the target audio is obtained by constructing the target spectrogram.
  • the embedded two-dimensional images are multiple two-dimensional images used to collect changes in user actions, sound effects brought about by changes in characteristics of the multiple two-dimensional images can be obtained.
  • the target audio is obtained through the construction of the spectrogram, so as to achieve the purpose of embedding image information in the audio, so that the image has the function of sounding, and the audio can contain image information, which greatly improves the audio and image. Relevance.
  • the embedded image is a constantly changing gesture image in the video stream as an example.
  • the music playback software if the user uses the camera to take a picture of a fixed camera and freely wave his finger in front of the camera, the video stream contains multiple gesture images, and the first gesture image and the second gesture are collected at intervals of 100ms image, perform the processing process of step S201 on the first gesture image and the second gesture image, obtain the target grayscale image corresponding to the first gesture image and the second gesture image, and calculate the target of the first gesture image and the second gesture image
  • the difference value of the grayscale images, and the target grayscale images corresponding to the plurality of gesture images are determined according to the grayscale difference values.
  • the grayscale data matrix of the first gesture image is (0.1, 0.2, 0.3; 0.4, 0.5, 0.6; 0.7 ,0.8,0.9)
  • the grayscale data matrix of the second gesture image (0.11,0.23,0.34; 0.48,0.56,0.64;0.78,0.89,0.92)
  • the grayscale difference is (0.01,0.02,0.04;0.08 , 0.06, 0.04; 0.08, 0.09, 0.02).
  • the grayscale data matrix will be inverted up and down; the inverted grayscale data matrix is used as the frequency domain data of the target language spectrogram to obtain the target language spectrogram.
  • step S904 is performed on the target spectrogram, and the time domain signals of each frame are spliced according to the aliasing rate of 60% to obtain the target audio.
  • the above operations can be performed multiple times in the video stream, and then the audio of multiple gesture changes can be felt in the formed target audio.
  • there are multiple gesture images in the video stream and each image is collected at an interval of 100ms.
  • the obtained audio reflects the sound effect brought by the change of the dynamic image in the video, and the generated audio can also be shared with other users to share the peculiar sound effect brought by the dynamic change with friends.
  • the embodiment of the present invention further discloses an audio generation apparatus.
  • the audio generating apparatus may be a computer program (including program codes/program instructions) running in an audio generating device such as a terminal.
  • the audio generation apparatus may perform the methods of FIGS. 1 , 5 and 9 .
  • the audio generation device can run the following modules:
  • an acquisition module 1001 configured to receive an audio generation instruction input by a user, where the audio generation instruction is used to indicate a two-dimensional image that the user wants to embed in the generated target audio;
  • the acquiring module 1001 is further configured to acquire the target grayscale image of the two-dimensional image in response to the audio generation instruction;
  • the processing module 1002 is used to convert the grayscale data of each pixel in the target grayscale image into the frequency domain data of each pixel in the spectrogram to obtain the target spectrogram;
  • the processing module 1002 is further configured to use the target spectrogram to generate target audio corresponding to the target spectrogram.
  • the processing module 1002 is further configured to receive an audio selection instruction input by the user, where the audio selection instruction is used to instruct the original audio required for generating target audio, and in response to the audio selection instruction, obtain the audio selection instruction
  • the original spectrogram corresponding to the original audio; when converting the grayscale data of each pixel in the target grayscale image into the frequency domain data of each pixel in the spectrogram to obtain the target spectrogram, it can be specifically used for : using the grayscale data of each pixel in the grayscale image to process the frequency domain data of each pixel in the original spectrogram to obtain a target spectrogram.
  • the grayscale data of each pixel is a grayscale data matrix
  • the processing module 1002 uses the grayscale data of each pixel in the target grayscale image to analyze each pixel in the original spectrogram.
  • the frequency domain data of the pixel points are processed to obtain the target language spectrogram, it is specifically used for: performing up-down inversion processing on the grayscale data matrix;
  • the frequency domain data of each pixel in the spectrogram is weighted to obtain the target language spectrogram.
  • the grayscale data of each pixel is a grayscale data matrix
  • the processing module 1002 uses the grayscale data of each pixel in the target grayscale image to analyze each pixel in the original spectrogram.
  • the frequency domain data of the pixel is specifically used for: inverting the grayscale data matrix up and down, and downsampling the inverted grayscale data matrix;
  • the grayscale data matrix is used as a weighting factor to weight part of the frequency domain data of the original spectrogram to obtain the target spectrogram.
  • the grayscale data of each pixel is a grayscale data matrix
  • the processing module 1002 converts the grayscale data of each pixel in the target grayscale image into each pixel in the spectrogram.
  • the target spectrogram is specifically used for: performing up-down inversion processing on the grayscale data matrix, and using the inverted grayscale data matrix as the frequency domain data of each pixel in the spectrogram, Get the target spectrogram.
  • the processing module 1002 when the processing module 1002 uses the target spectrogram to generate the target audio corresponding to the target spectrogram, the processing module 1002 is specifically configured to: process each frame of frequency domain data of the target spectrogram. Up and down inversion processing, the complex number of the inverted frequency domain data is conjugated; the inverse Fourier transform is performed on each frame of frequency domain data after the conjugation to obtain the time domain signal corresponding to each frame of frequency domain data, and synthesizing the time domain signals of each frame into target audio.
  • the processing module 1002 when used to acquire the target grayscale image of the two-dimensional image, it is specifically configured to: acquire the original grayscale image of the two-dimensional image, and perform an operation on the original grayscale image.
  • the proportional scaling process is performed to obtain a grayscale image after the proportional scaling process; the grayscale image after the proportional scaling process is normalized to obtain the target grayscale image of the two-dimensional image.
  • the two-dimensional image includes a plurality of two-dimensional images used to collect user action changes; when acquiring the grayscale image of the two-dimensional image, the processing module 1002 is specifically configured to: calculate a plurality of The grayscale difference values between the two-dimensional images whose time is adjacent in the two-dimensional images are collected to obtain a plurality of grayscale difference values; the plurality of grayscale difference values are collected according to the acquisition time corresponding to the grayscale difference values. Arrange to obtain the target grayscale image.
  • the processing module 1002 is further configured to receive an audio playback instruction input by a user; in response to the audio playback instruction, play the target audio and display the playback progress corresponding to the playback progress of the target audio.
  • the target spectrogram of the area is further configured to receive an audio playback instruction input by a user; in response to the audio playback instruction, play the target audio and display the playback progress corresponding to the playback progress of the target audio.
  • the target spectrogram of the area is further configured to receive an audio playback instruction input by a user; in response to the audio playback instruction, play the target audio and display the playback progress corresponding to the playback progress of the target audio.
  • each step involved in the methods shown in FIGS. 1 , 5 and 9 may be performed by various modules in the audio generating apparatus shown in FIG. 10 .
  • steps S101 and S102 shown in FIG. 1 may be performed by the acquisition module 1001 shown in FIG. 10
  • steps S103 and S104 may be performed by the processing module 1002 shown in FIG. 10 .
  • each module in the audio generation apparatus shown in FIG. 10 may be separately or all combined into one or several other modules to form, or some module(s) may be further disassembled It is divided into multiple modules with smaller functions, which can realize the same operation without affecting the realization of the technical effects of the embodiments of the present invention.
  • the above modules are divided based on logical functions.
  • the function of one module may also be implemented by multiple modules, or the functions of multiple modules may be implemented by one module.
  • the audio-based generating apparatus may also include other modules.
  • these functions may also be implemented with the assistance of other modules, and may be implemented by cooperation of multiple modules.
  • the target grayscale image of the two-dimensional image that the user wants to be embedded in the generated target audio can be obtained by responding to the audio generation instruction, and the target grayscale image can be stored in the target audio.
  • the grayscale data of each pixel is converted into the frequency domain data of each pixel in the spectrogram, and the target spectrogram is obtained, that is, the two-dimensional image is associated with the target spectrogram of the target audio, and then the target spectrogram is used.
  • the target audio corresponding to the target spectrogram is generated, so as to realize the generation of the target audio according to the two-dimensional image.
  • the embodiment of the present application can achieve the purpose of embedding image information in the audio, so that the image has a sound-emitting function, and at the same time, the audio can contain image information, which greatly improves the correlation between the audio and the image.
  • an embodiment of the present invention further provides an audio generation device.
  • the device at least includes a processor 1101 and a memory 1102, and the processor 1101 and the memory 1102 are connected to each other.
  • the audio generation device may further include an input device 1103 and/or an output device 1104 .
  • the processor 1101, the input device 1103, the output device 1104, and the memory 1102 may be connected by a bus or other means.
  • the memory 1102 may be used to store a computer program (or may be used to store a computer (readable) storage medium including a computer program) comprising program instructions, the processor 1101 being configured to invoke the program instruction.
  • the processor 1101 (or called CPU (Central Processing Unit, central processing unit)) is the computing core and the control core of the device, which is configured to call the program instructions, and is specifically adapted to load and execute the program instructions to realize the above method process. or corresponding function.
  • Input device 1103 may include one or more of a keyboard, touch screen, radio frequency receiver, or other input device;
  • output device 1104 may include a display screen (display), which may also include a speaker, radio frequency transmitter, or other output one or more of the devices.
  • the device may further include a memory module, a power module, an application client, and the like.
  • the processor 1101 described in this embodiment of the present invention may be configured to perform a series of audio generation processing, including: receiving an audio generation instruction input by a user, where the audio generation instruction is used to indicate that the user wants to A two-dimensional image embedded in the generated target audio; in response to the audio generation instruction, acquiring a grayscale image of the two-dimensional image; converting the grayscale data of each pixel in the grayscale image into a spectrogram obtain the target spectrogram; use the target spectrogram to generate the target audio corresponding to the target spectrogram; .
  • Embodiments of the present invention also provide a computer (readable) storage medium, where the computer storage medium may be a memory device in the device, used to store programs and data. It can be understood that, the computer storage medium here may include both the built-in storage medium in the device, and certainly also the extended storage medium supported by the device.
  • the computer storage medium provides storage space in which an operating system of an audio generating device such as a terminal is stored.
  • program instructions suitable for being loaded and executed by the processor 1101 are also stored in the storage space, and these instructions may be one or more computer programs (including program codes).
  • the computer storage medium here can be a high-speed RAM memory, or a non-volatile memory 11 (non-volatile memory), such as at least one disk memory; optionally, at least one storage medium located far away from the aforementioned processing The computer storage medium of the device 1101.
  • the program instructions in the computer storage medium can be loaded and executed by the processor 1101 to implement the corresponding steps of the method in the foregoing embodiment; for example, in a specific implementation, the program instructions in the computer storage medium are loaded by the processor 1101 and executed. Perform the following steps:
  • the audio generation instruction is used to indicate the two-dimensional image that the user wants to embed in the generated target audio
  • the target audio corresponding to the target spectrogram is generated by using the target spectrogram.
  • the program instructions can also be loaded and executed by the processor 1101: receive an audio selection instruction input by the user, the audio selection instruction is used to indicate the original audio required for generating the target audio, and respond to the audio selection instruction The audio selection instruction obtains the original spectrogram corresponding to the original audio; after converting the grayscale data of each pixel in the target grayscale image into the frequency domain data of each pixel in the spectrogram, the target spectrogram is obtained
  • the program instructions can also be loaded by the processor 1101 and specifically executed: use the grayscale data of each pixel in the target grayscale image to process the frequency domain data of each pixel in the original spectrogram, Get the target spectrogram.
  • the grayscale data of each pixel is a grayscale data matrix
  • the grayscale data of each pixel in the target grayscale image is used to compare the grayscale data of each pixel in the original spectrogram.
  • the program instructions can also be loaded by the processor 1101 and specifically executed: perform up-down inversion processing on the grayscale data matrix; use the inverted grayscale data matrix as a weighted factor, and weight the frequency domain data of each pixel in the original spectrogram to obtain the target spectrogram.
  • the grayscale data of each pixel is a grayscale data matrix
  • the grayscale data of each pixel in the target grayscale image is used to compare the grayscale data of each pixel in the original spectrogram.
  • the program instructions can also be loaded by the processor 1101 and specifically executed: perform up-down inversion processing on the grayscale data matrix, and downgrade the inverted grayscale data matrix. Sampling processing; using the grayscale data matrix after down-sampling processing as a weighting factor, weighting the frequency domain data of each pixel in the original spectrogram to obtain a target spectrogram.
  • the grayscale data of each pixel is a grayscale data matrix
  • the grayscale data of each pixel in the target grayscale image is converted into the frequency domain of each pixel in the spectrogram.
  • the program instructions can also be loaded by the processor 1101 and specifically executed: perform up-down inversion processing on the grayscale data matrix, and use the inverted grayscale data matrix as each pixel in the spectrogram.
  • the frequency domain data of the points are obtained to obtain the target language spectrogram.
  • the program instructions can also be loaded by the processor 1101 and specifically executed: Each frame of frequency domain data is inverted up and down, and the complex number of the inverted frequency domain data is conjugated; the inverse Fourier transform is performed on each frame of frequency domain data after conjugation to obtain each frame of frequency domain data. The time domain signal corresponding to the data is synthesized, and the time domain signal of each frame is synthesized into the target audio.
  • the program instructions when acquiring the target grayscale image of the two-dimensional image, can also be loaded by the processor 1101 and specifically executed: acquiring the original grayscale image of the two-dimensional image, Perform proportional scaling processing on the grayscale image to obtain a proportionally scaled grayscale image; perform normalization processing on the proportionally scaled grayscale image to obtain a target grayscale image of the two-dimensional image.
  • the two-dimensional image includes a plurality of two-dimensional images used to capture changes in user actions
  • the program instructions can also be loaded by the processor 1101 when acquiring the target grayscale image of the two-dimensional image. And specifically perform: respectively calculating the grayscale difference values between the two-dimensional images whose acquisition time is adjacent in the plurality of two-dimensional images, to obtain a plurality of grayscale difference values; The acquisition times corresponding to the grayscale difference values are arranged to obtain the target grayscale image.
  • the program instructions can also be loaded by the processor 1101 and specifically executed: receive an audio playback instruction input by the user; in response to the audio playback instruction, play the target audio and follow the playback progress of the target audio.
  • the target spectrogram of the area corresponding to the playback progress is displayed.
  • the target grayscale image of the two-dimensional image that the user wants to be embedded in the generated target audio can be obtained by responding to the audio generation instruction, and the target grayscale image can be stored in the target audio.
  • the grayscale data of each pixel is converted into the frequency domain data of each pixel in the spectrogram, and the target spectrogram is obtained, that is, the two-dimensional image is associated with the target spectrogram of the target audio, and then the target spectrogram is used.
  • the target audio corresponding to the target spectrogram is generated, so as to realize the generation of the target audio according to the two-dimensional image.
  • the embodiment of the present application can achieve the purpose of embedding image information in the audio, so that the image has a sound-emitting function, and at the same time, the audio can contain image information, which greatly improves the correlation between the audio and the image.
  • the storage medium may be a magnetic disk, an optical disk, a read-only memory (Read-Only Memory, ROM), or a random access memory (Random Access Memory, RAM) or the like.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Signal Processing (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Image Processing (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)

Abstract

一种音频生成方法及设备,其中方法包括:接收用户输入的音频生成指令(S101),音频生成指令用于指示用户想要在生成的目标音频中嵌入的二维图像;响应于音频生成指令,获取二维图像的目标灰度图像(S102)(S501)(S901);将目标灰度图像中各个像素点的灰度数据转换为语谱图中各个像素点的频域数据,得到目标语谱图(S103);利用目标语谱图生成目标语谱图对应的目标音频(S104)(S504)(S903)。可以实现在音频中嵌入图像信息的目的,使得图像具备发声功能,同时音频中又可以包含了图像信息,大大地提升了音频与图像的关联性。

Description

一种音频生成方法及设备 技术领域
本申请涉及音频处理技术领域,尤其涉及一种音频生成方法及设备。
背景技术
目前存在一些将图片和音频关联的场景,例如,直接将图片作为音频文件的封面,然后将图片和音频存储为新的文件格式,以便于用户在播放音频时,可以直接展示该图片。该方式下,图片仅仅作为音频的封面图片,图片与音频之间的相关性比较低,实用性较差。
发明内容
本申请实施例提供了一种基于图像处理音频生成方法及设备,能够实现在音频中嵌入图像信息的目的,使得图像具备发声功能,同时音频中又可以包含了图像信息,大大地提升了音频与图像的关联性。
一方面,本申请实施公开了一种音频生成方法,该方法包括:
接收用户输入的音频生成指令,所述音频生成指令用于指示用户想要在生成的目标音频中嵌入的二维图像;
响应于所述音频生成指令,获取所述二维图像的目标灰度图像;
将所述目标灰度图像中各个像素点的灰度数据转换为语谱图中各个像素点的频域数据,得到目标语谱图;
利用所述目标语谱图生成所述目标语谱图对应的目标音频。
另一方面,本申请实施例提供了一种音频生成设备,该设备包括:
处理器和存储器,所述处理器和存储器相互连接,其中,所述存储器用于存储计算机程序,所述计算机程序包括程序指令,所述处理器被配置用于调用所述程序指令,并执行如下步骤:
接收用户输入的音频生成指令,所述音频生成指令用于指示用户想要在生成的目标音频中嵌入的二维图像;
响应于所述音频生成指令,获取所述二维图像的目标灰度图像;
将所述目标灰度图像中各个像素点的灰度数据转换为语谱图中各个像素点的频域数据,得到目标语谱图;
利用所述目标语谱图生成所述目标语谱图对应的目标音频。
再一方面,本申请实施例提供了一种计算机可读存储介质,该计算机可读存储介质存储有计算机程序,所述计算机程序包括程序指令,所述程序指令当被处理器执行时使所述处理器执行如下步骤:
接收用户输入的音频生成指令,所述音频生成指令用于指示用户想要在生成的目标音频中嵌入的二维图像;
响应于所述音频生成指令,获取所述二维图像的目标灰度图像;
将所述目标灰度图像中各个像素点的灰度数据转换为语谱图中各个像素点的频域数据,得到目标语谱图;
利用所述目标语谱图生成所述目标语谱图对应的目标音频。
本申请实施例在接收到音频生成指令时,能够通过响应于该音频生成指令,获取用户想 要在生成的目标音频中嵌入的二维图像的目标灰度图像,并将该目标灰度图像中各个像素点的灰度数据转换为语谱图中各个像素点的频域数据,得到目标语谱图,也就是将二维图像与目标音频的目标语谱图关联起来,然后利用目标语谱图生成目标语谱图对应的目标音频,从而实现根据二维图像生成目标音频。由此可见,本申请实施例能够实现在音频中嵌入图像信息的目的,使得图像具备发声功能,同时音频中又可以包含了图像信息,大大地提升了音频与图像的关联性。
附图说明
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是本申请实施例提供的一种音频生成方法的流程示意图;
图2是本申请实施例提供的一种获取目标灰度图像的流程示意图;
图3是本申请实施例提供的一种图像处理过程的效果示意图;
图4是本申请实施例提供的一种目标语谱图合成音频的流程示意图;
图5是本申请实施例提供的另一种音频生成方法的流程示意图;
图6是本申请实施例提供的一种原始语谱图获取方法的流程示意图;
图7a是本申请实施例提供的一种目标语谱图的效果示意图;
图7b是本申请实施例提供的另一种目标语谱图的效果示意图;
图8a是本申请实施例提供的一种目标语谱图实例图;
图8b是本申请实施例提供的另一种目标语谱图实例图;
图9是本申请实施例提供的再一种音频生成方法的流程示意图;
图10是本申请实施例提供的一种音频生成装置的结构示意图;
图11是本申请实施例提供的一种音频生成设备的结构示意图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述。
本申请实施例能够实现在音频中嵌入图像信息,比如根据图像信息改造或者构造语谱图,从而得到带有图像信息的音频,提升了图像和音频之间的相关性,用户能直观感受到含有图像信息的音频。
在本申请中,语谱图可以是指一种语音频谱图。语谱图的横坐标可以为时间,纵坐标可以为频率,每个坐标点值可表示语音数据能量值的大小,语谱图中每个时间点对应的一列数据表示某一帧音频信号对应的频域数据。其中,语音数据能量值的大小通常用颜色深浅表示,颜色越深可表示能量值越大,或者还可通过其余方式表示,本申请不做限定。
本申请涉及的音频生成方案能够用于音频生成设备中,比如可具体应用于音频生成设备安装的各类音频软件中,包括但不限于音乐播放软件、音频编辑软件、音频转换软件等等。该音频生成设备可以为终端,或者可以为服务器,或者可以为其余设备,本申请不做限定。可选的,此处的终端可以包括但不限于:智能手机、平板电脑、膝上计算机以及台式电脑等等。
基于上述描述,本申请实施例提出的音频生成方法、装置、设备及介质,可以通过利用 图像信息改造或构造语谱图的形式获取带有图像信息的音频。从而实现在音频中嵌入图像信息的目的,使得图像具备发声功能,同时音频中又可以包含了图像信息,大大地提升了音频与图像的关联性。以下分别详细说明。
请参见图1,图1为本申请实施例提供的一种音频生成方法的流程示意图。图1所示流程可以包括以下步骤S101-S104。
S101、接收用户输入的音频生成指令。
其中,该音频生成指令可用于指示用户想要在生成的目标音频中嵌入的二维图像。该二维图像可以现有的存储为图片格式的图像,也可以是在临时创作区域创作的内容,还可以是多个用于采集用户动作变化的二维图像。若用户想要在生成的目标音频中嵌入的内容为文字、表格等非图片格式的文件,则可将该非图片格式的文件转换为图片格式,进而将转换为图片格式后的图像嵌入目标音频。该图片格式可以是静态图像文件格式,如jpg、png、bmp、jpeg等,此处不做限定。例如,获取需要嵌入目标音频的文件,判断文件的后缀名,如果不为图片格式,如vsd、xls、doc等文件格式,就将其转化为图片格式。
S102、响应于所述音频生成指令,获取所述二维图像的目标灰度图像。
其中,该目标灰度图像可以是通过获取二维图像并对该二维图像进行处理得到,也可以是从存储器直接获取进行处理后的灰度图像作为目标灰度图像,本申请不做限定。可选的,该目标灰度图像也可被称为目标灰度图、目标灰度信息、目标灰度矩阵等,该目标灰度图像可以为灰度数据矩阵、带有像素值的方框图等等,该目标灰度图像中每个位置的值可以被称为灰度值、像素值等,此处不做限制。
在可能的实现方式中,通过获取二维图像并对该二维图像进行处理可以包括:获取二维图像的原始灰度图像、对原始灰度图像进行等比缩放处理、对原始灰度图像进行直方图均衡化处理、对原始灰度图像进行归一化处理等操作。例如,如图2所示,获取二维图像的目标灰度图像可以包括以下步骤S201-S202。
S201、获取所述二维图像的原始灰度图像,对所述原始灰度图像做等比缩放处理,得到等比缩放处理后的灰度图像。
其中,该二维图像的原始灰度图像可以为图像处理领域的灰度图概念,图像中每个像素的灰度级数为256阶,255代表全白,0表示全黑。例如,某二维图像灰度化处理后的原始灰度图像为(0,100,123;215,124,165;255,65,98)。为方便理解,此处将该二维图像的原始灰度图像记为GrayP1,图片的高为H1。
在一种可能的实施方式中,该等比缩放处理可以是根据缩放因子进行等比缩放处理。等比缩放处理旨在调整二维图像的原始灰度图像GrayP1的高H1,得到等比缩放处理后的灰度图像,为方便理解将等比缩放处理后的灰度图像记为GrayP2,GrayP2的高记为H2。需要说明的是,等比缩放处理后的灰度图像的高H2为预设值,可以根据等比缩放处理后的灰度图像的高H2与二维图像的原始灰度图像的高H1计算得到等比缩放因子scale,如scale=H2/H1。确定出对应的等比缩放因子后,可以使用该等比缩放因子将原始灰度图像等比例调节到合适尺寸,以便于通过最终生成的目标灰度图像改造或构造原始音频从而得到目标音频。可选的,等比缩放处理后的灰度图像GrayP2的高H2可以为2^N+1,N为预设的正整数。该等比缩放处理后的灰度图像的高H2可以根据用户需要生成的目标音频对应的目标语谱图的高确定,或者可以根据原始语谱图的频域数据定出,或者可以根据设备屏幕大小和/或分辨率确定出,或者可以通过其他方式确定出,本申请不做限定。
S202、对所述等比缩放处理后的灰度图像做归一化处理,得到所述二维图像的目标灰度 图像。
对该等比缩放处理后的灰度图像做归一化处理,可以遍历该等比缩放处理后的灰度图像GrayP2的所有值,找到GrayP2的最大值max(GrayP2),并对所有数据做归一化处理,得到二维图像的目标灰度图像,为方便理解,将该目标灰度图像记为GrayP3,Gray P3为灰度数据矩阵,即:
GrayP3=GrayP2/max(GrayP2)。
例如,若GrayP2为(20,30,40;50,60,70;80,90,100),经过归一化之后,目标灰度图像GrayP3为(0.2,0.3,0.4;0.5;0.6,0.7;0.8,0.9,1)。经过步骤S201-S202,得到二维图像的目标灰度图像的灰度数据矩阵GrayP3,其中,GrayP3所有数据介于0-1之间。如图3所示,图3展现一种图像处理过程的效果展示图,一张彩色图片经过灰度化转换得到二维图像的原始灰度图像,然后经过缩放处理得到等比缩放处理后的灰度图像,再经过归一化处理得到二维图像的目标灰度图像。
在可能的实现方式中,还可对等比缩放处理后的灰度图像GrayP2做直方图均衡化处理,以增强GrayP2中不同位置数据的对比度,提高图片质量。在具体实施例中,可直接调用函数进行处理,如MATLAB中的histeq函数,opencv中的equalizeHist函数等。进而可对直方图均衡化处理的灰度图像进行归一化处理,得到二维图像的目标灰度图像。
需要说明的是,若二维图像符合上述步骤S201-S202的处理结果标准,则无需对该二维图像进行上述步骤S201-S202的操作,直接将该二维图像作为目标灰度图像。
在一种实施方式中,该二维图像可以包括多个用于采集用户动作变化的二维图像,该用户动作变化可以是指用户的手势变化、脸部表情变化等等,此处不做限制。该二维图像的目标灰度图像的获取可以包括以下步骤:分别计算该多个二维图像中采集时间相邻的二维图像之间的灰度差值,得到多个灰度差值;将该多个灰度差值按照灰度差值对应的采集时间进行排列,得到目标灰度图像。该多个二维图像的采集来源可以是实时拍摄的视频,也可以是存储于音频生成设备如终端或其余存储设备的视频,还可以是连续拍摄的多张图像等,此处不做限制。该灰度差值可以是根据该采集时间相邻的二维图像对应的目标灰度图像的差值。例如,一段视频中存在多个二维图像,在t1、t2、t3时间点分别采集一张二维图像,从而得到三张二维图像P1、P2、P3,根据步骤S201-S202,得到该三张二维图像的目标灰度图像,计算P1与P2间的灰度差值,以及P2与P3间的灰度差值,将这两个灰度差值按照采集时间进行排列,比如将P1与P2间的灰度差值排在P2与P3间的灰度差值之前,也就是靠左的位置,从而得到该多个用于采集用户动作变化的二维图像对应的目标灰度图像。
S103、将所述目标灰度图像中各个像素点的灰度数据转换为语谱图中各个像素点的频域数据,得到目标语谱图。
本申请实施例获取目标语谱图主要分为两种方式,在一种实施方式中,基于该二维图像的目标灰度图像对原始音频的原始语谱图进行改造以得到目标语谱图,比如可将目标灰度图像如灰度数据矩阵GrayP3作为加权因子对原始音频的原始语谱图加权处理,得到目标语谱图;再一种实施方式中,基于该二维图像的目标灰度图像构建(构造)音频的目标语谱图,以便于直接利用该目标灰度图像得到目标音频,比如可将灰度数据矩阵GrayP3直接作为频域数据,得到目标语谱图。以便于实现改造或构造音频的语谱图以获取带有图像信息的音频,使图像和音频能够紧密联系,大大地提升了音频与图像的关联性。
S104、利用所述目标语谱图生成所述目标语谱图对应的目标音频。
其中,目标音频是指生成的嵌入了图像信息如上述的二维图像的信息的音频。可选的, 利用目标语谱图生成目标语谱图对应的目标音频可以包括一下步骤:获取目标语谱图的每一帧频域数据对应的时域信号;根据每一帧频域数据对应的时域信号,得到目标音频。例如,可以将目标语谱图的每一帧频域数据进行上下翻转处理,对翻转处理后的频域数据的复数取共轭;对取共轭后的每一帧频域数据进行逆傅里叶变换,得到每一帧频域数据对应的时域信号,并将各帧时域信号合成为目标音频。
在可能的实现方式中,由于目标语谱图的频域数据具有共轭对成性,在将目标语谱图的频域数据合成时域信号时,若语谱图每一帧频域数据有2^N+1个数据,只需要对频域数据的第2到第2^N/2个数据进行上下翻转处理,对翻转处理后的频域数据的复数取共轭,该N为正整数。例如,目标语谱图的每一帧频域数据有1025个数据,则只需要对第2到第512个数据进行上下翻转处理,对翻转处理后的频域数据的复数取共轭。进而可对取共轭后的每一帧频域数据进行逆傅里叶变换,得到每一帧频域数据对应的时域信号,则目标语谱图的每一帧频域数据均被转换为时域信号。
在得到目标语谱图的每一帧频域数据对应的时域信号之后,可以将每一帧时域信号根据一定混叠率进行混叠拼接,得到完整的音频信号。为了便于与其他音频区分,可以将该音频信号所表示的音频称为目标音频。该目标音频嵌入了图像信息,用户能够直观的感受到图像信息对原始音频带来的改变或图像信息直接构成的独特声音。步骤S104的过程如图4所示,目标语谱图由多帧频域数据组成,每一帧频域数据转化成对应的时域信号,多帧时域信号混叠拼接成音频信号。
在可能的实现方式中,在得到目标音频之后,接收用户输入的音频播放指令;响应于该音频播放指令,播放所述目标音频并按照目标音频的播放进度展示与该播放进度对应面积的目标语谱图。由此可以实现在播放目标音频时,随着音频的播放进度关联地一点点展示出所嵌入的图像。例如,可以在接收到针对该目标音频的播放指令时,播放该目标音频,当播放到时间点t1时,展示0-t1之间对应面积的目标语谱图,当播放到时间点t2时,展示0-t2之间对应面积的目标语谱图,当播放完毕时,展示完整的目标语谱图。可选的,还可以在接收到针对该目标音频的分享指令时,将该目标音频分享给目标对象,该目标对象可以是联系人,也可以是应用软件中的功能模块,此处不做限制。
通过图1所示方法,可获得带有图像信息的目标音频,目标音频在播放时,可以随着音乐的播放逐渐展示出目标音频的目标语谱图,从而使用户能直观地看到嵌入的图像信息,所得到的目标音频还可以分享给其他用户。
例如,在某音乐播放软件中,用户从终端导入一张图片a和一段音频b,经过本实施例的处理,则可得到嵌入了图像a的音频c,音频c在播放时,会随着音乐的播放逐渐展示出音频的语谱图,从而使用户能直观地看到嵌入的图像信息。
又如,在某音乐播放软件中,用户使用终端的摄像头拍摄带有动态变化的视频,经过该实施例的处理,从动态变化的视频中截取多个表示用户动作变化的二维图像,对该二维图像处理后得到音频d,音频d呈现出动态变化带来的声音效果。
本申请实施例从整体上对技术方案进行介绍,本申请根据图像信息获取音频的方式可分为两种,主要区别在于目标语谱图的获取方式的不同,一种是通过目标灰度图像改造语谱图的方式得到音频;另一种是通过目标灰度图像构造语谱图的方式得到音频。从而可通过对语谱图的改造或构造得到目标音频,音频中嵌入有图像信息,图像信息与音频紧密结合,使得图像具备发声功能,同时声音中有包含了图像信息,所谓声音中有包含了图像信息,即音频的语谱图中含有图像信息。通过本申请实施例,能够实现在音频中嵌入图像信息的目的,使 得图像具备发声功能,同时音频中又可以包含了图像信息,大大地提升了音频与图像的关联性,且操作过程具有较强的灵活性、趣味性。
请参见图5,是本申请实施例提供的另一种音频生成方法的流程示意图。如图5所示,该音频生成方法为基于该二维图像的目标灰度图像对音频的语谱图进行改造以得到目标语谱图,进而得到目标音频,包括以下步骤S501-S504。
S501、接收用户输入的音频生成指令,响应于所述音频生成指令,获取所述二维图像的目标灰度图像。
此步骤参照步骤S101-S102的相关描述,这里不再赘述。
在本申请实施例中,可以基于该二维图像的目标灰度图像对原始音频的语谱图进行改造以得到目标语谱图,则可以在对二维图像的原始灰度图像做等比缩放处理时,将该原始灰度图像的高等比缩放至与原始语谱图的高相同。
S502、接收用户输入的音频选择指令,并响应于所述音频选择指令,获取所述原始音频对应的原始语谱图。
其中,该音频选择指令用于指示生成目标音频所需的原始音频。可选的,该原始音频可以是本地存储的音频文件,也可以是临时下载的其他存储设备上的音频文件,该音频文件的内容可以是音乐、谈话内容、噪音等,本申请不做限定。
在具体实施过程中,根据原始音频得到原始语谱图的过程可以如图6所示。例如,可以对原始音频的时域信号做分帧处理,得到多帧时域信号;其中,帧长为每一帧的时间长度,帧移为相邻两帧重叠的时间长度,如,第k帧时域信号的起始时间为t,结束时间为t+E,第k+1帧时域信号起始时间为t+L,结束时间为t+E+L,则帧长为E,帧移为L;将每一帧时域信号做加窗处理,窗函数的长度应与帧长的长度一致,窗函数可以采用汉宁窗、矩形窗、三角窗、海明窗、高斯窗等;对该多帧加窗后的时域信号的每一帧做快速傅里叶变换(FFT),得到多帧频域数据;将每一帧频域数据按照列向量的形式排布,得到原始语谱图。例如,排布时将所有频域数据按照频率由下往上逐渐增大,将所有频域数据按照时间顺序横向摆放,得到原始语谱图,原始语谱图横轴为时间,纵轴为频率,坐标点的值为能量值,能量值的大小用颜色深浅表示。
可选的,在对每一帧加窗后的时域信号做快速傅里叶变换(FFT),得到多帧频域数据时,若每一帧加窗后的时域信号为2^K个数值,能够减少傅里叶变换的时间复杂度,从而能提升傅里叶变换的运算效率。相应的,针对每一帧时域信号对应的频域数据具有(2^K/2)+1个数值,其中,K为正整数。或者可以表示为,如果每一帧时域信号为2^(N+1)个数值,则得到的每一帧对应的频域数据具有2^N+1个数值。其中,N为大于或等于0的整数。
S503、利用所述目标灰度图像中各个像素点的灰度数据对所述原始语谱图中各个像素点的频域数据进行处理,得到目标语谱图。
其中,该目标灰度图像中各个像素点的灰度数据可以通过灰度数据矩阵进行表示,该灰度数据矩阵中,每个数值代表目标灰度图像中对应位置的像素点的值。
在可能的实施方式中,利用目标灰度图像中各个像素点的灰度数据对原始语谱图中各个像素点的频域数据进行处理,得到目标语谱图,可以包括以下操作:对灰度数据矩阵进行上下翻转处理;将翻转处理后的灰度数据矩阵作为加权因子,对原始语谱图中各个像素点的频域数据进行加权,得到目标语谱图。
其中,在具体实施过程中,上下翻转处理可表示将灰度数据矩阵按照Y轴方向上下翻转。例如,灰度数据矩阵为(0.1,0.2,0.3;0.4,0.5,0.6;0.7,0.8,0.9),上下翻转处理后的灰 度数据矩阵为(0.7,0.8,0.9;0.4,0.5,0.6;0.1,0.2,0.3)。
可选的,通过加权因子对原始语谱图中各个像素点的频域数据进行加权可以对频域数据的所有数据做加权,但该原始语谱图的频域数据具有共轭对称性,若频域数据有2^N+1个数据,只需要对频域数据的第2到第2^N/2+1个数据做加权处理,则可实现对频域数据的所有数据做加权的目的,得到的目标语谱图效果如图7a所示,虚线框包围的部分为嵌入的二维图像,虚线外为原始语谱图的频域数据,该原始语谱图的横轴为时间,纵轴为频率,颜色的深浅代表对应坐标点能量值的大小。可以看到,嵌入的二维图像的高与原始语谱图的高相等,这是由于步骤S501将原始灰度图像的高缩放至与原始语谱图的高相等。
可选的,该翻转处理后的灰度数据矩阵可进行降采样,缩小灰度数据矩阵的大小,将降采样后的灰度数据矩阵作为加权因子,对原始语谱图的部分频域数据做加权,得到目标语谱图,从而可将二维图像嵌入到原始语谱图的局部位置。例如,频域数据有2^N+1个数据,灰度数据矩阵的高有2^N+1个像素,对灰度数据矩阵进行降采样,降采样因子为1/2,则灰度数据矩阵的高变为2^N/2+1,可对频域数据的第M到第M+2^N/2+1个数据做加权,得到的目标语谱图只有频域数据的第M到第M+2^N/2+1个数据含有图像信息,M、N为正整数。经此步骤得到的目标语谱图效果可以如图7b所示,虚线框住的部分为嵌入图像,虚线外为原始的原始语谱图的频域数据,该原始语谱图的横轴为时间,纵轴为频率,颜色的深浅代表对应坐标点能量值的大小。可以看到,嵌入的二维图像的高与原始语谱图的高不相等,嵌入图像只存在于原始语谱图的局部位置。若将加权因子缩放至更小,那么对原始语谱图加权之后,嵌入信息对原始音频的影响将非常小,所合成的目标音频与原始音频大致上无异,可隐秘地将图像信息嵌入目标音频中。
S504、利用所述目标语谱图生成所述目标语谱图对应的目标音频。
本步骤的描述可参照步骤S104,其中,将每一帧时域信号合成音频时,混叠率可以根据步骤S502分帧处理的帧移与帧长确定出,比如可以为分帧处理时帧移与帧长的比值,以便于将未做加权处理部分的原始语谱图合成音频。如,分帧处理时的帧长为2W,帧移为W,则混叠率应取值W/2W,即50%。将每一帧时域信号混叠拼接在一起得到完整的音频信号,即为目标音频。
本申请实施例在获取目标语谱图时,将灰度数据矩阵作为加权因子对原始语谱图的频域数据加权处理,得到目标语谱图,对目标语谱图的每一帧频域数据做傅里叶变换得到时域信号,再将时域信号混叠拼接,最终得到目标音频,即通过改造原始语谱图的方式得到音频。由此可见,通过对原始语谱图的改造得到目标音频能够在音频中嵌入图像,使得图像具备发声功能,同时音频中又可以包含了图像信息,大大地提升了音频与图像的关联性。
下面以将本申请实施例所提出方法应用于某音乐播放软件中创作图像并改造原始语谱图得到新的音频为例,对该实施阐述的方法进行说明。此处的音乐播放软件包括但不限于手机端、电脑端等等。在某音乐播放软件中,提供一个临时创作区域,用户在该临时创作区创作内容,并将该创作内容保存为图片格式,同时,用户选择想要改造的音频文件。对创作的图像进行步骤S501的处理,得到目标灰度图像,其中将目标灰度图像的高缩放至为2^10+1像素,此数据设定是为了与原始语谱图的高相对照;同时依照步骤S502获取音频文件的原始语谱图,对原始音频做分帧处理时,帧长为30ms,帧移为15ms做加窗处理时,窗函数的长度为与帧长长度30ms相同的汉宁窗;对灰度数据矩阵和原始语谱图进行如步骤S503操作,原始语谱图每一帧频域数据有1025个数据,可只对原始语谱图每一帧频域数据的第2到第513个数据做加权处理,则可对整个频域数据做加权,得到目标语谱图;将目标语谱图的每一帧 频域数据的第2到第512个数据进行上下翻转处理,对翻转处理后的频域数据的复数取共轭;对取共轭后的每一帧频域数据进行逆傅里叶变换,得到每一帧频域数据对应的时域信号;然后按照混叠率将每一帧时域信号合成目标音频,该混叠率为帧移与帧长的比值15ms/30ms,即50%。最终生成的目标音频文件包含在创作区域所创作的内容,目标音频的目标语谱图的高与嵌入的二维图像的目标灰度图像的高是一致的,用音频软件查看所获得的目标音频的目标语谱图,实例效果图如图8a、8b所示,能够看到在目标语谱图中,二维图像是目标语谱图的一部分,但从频率轴来看,二维图像的高与目标语谱图的高相同,目标语谱图能量值的大小与二维图像的目标灰度图像的各个像素点的灰度数据对应。生成的目标音频还能够分享给其他用户,与朋友共享嵌入图像后的音频效果。
又如,在某音乐播放软件中,用户选择想要嵌入音频的图像,同时,选择想要改造的原始音频文件。对该图像进行步骤S501的处理,得到目标灰度图像,其中将目标灰度图像的高缩放至为2^10+1像素;同时依照步骤S503获取原始音频文件的原始语谱图,对原始音频做分帧处理时,帧长为40ms,帧移为20ms,做加窗处理时,窗函数的长度为与帧长的长度40ms相同的汉宁窗;对灰度数据矩阵和原始语谱图进行如步骤S504操作,若原灰度数据矩阵大小为1025*1025,降采样后,灰度数据矩阵变为513*513,原始语谱图每一帧频域数据有1025个数据,对语谱图的部分频域数据做加权,例如,降采样后灰度数据矩阵的大小为513*513,则可对频域数据的第100到第612个数据做加权,得到目标语谱图,得到的目标语谱图只有频域数据的第100到第612个数据含有图像信息,该第100到第612个数据可为其他连续的频域数据,如第200到第712个数据,第313到第825个数据等。按照步骤S505-S506对目标语谱图进行处理,其中,由于实数信号具有共轭对称性,将目标语谱图的每一帧频域数据的第2到第512个数据进行上下翻转处理,对翻转处理后的频域数据的复数取共轭;对取共轭后的每一帧频域数据进行逆傅里叶变换,得到每一帧频域数据对应的时域信号;然后按照混叠率将每一帧时域信号合成目标音频,该混叠率为帧移与帧长的比值20ms/40ms,即50%。最终生成目标音频文件包含了导入的图像的信息,目标音频的目标语谱图的高与嵌入的图像的高不一致,用音频软件查看所获得的目标音频的目标语谱图,能够看到在目标语谱图中,该图像是目标语谱图的一部分,且与从频率轴来看,图像的高也只占目标语谱图的高的一部分,目标语谱图能量值的大小与图像的各个像素点的灰度数据对应。生成的目标音频还能够分享给其他用户,与朋友共享嵌入图像后的音频效果。
又如,通过本申请实施例的方法,可以获取多个二维图像(如一个视频中的多个二维图像,或者实时采集的多个手势图像等等)作为需要嵌入原始音频的二维图像。具体可分别计算该多个二维图像中采集时间相邻的二维图像之间的灰度差值,得到多个灰度差值;将多个灰度差值按照灰度差值对应的采集时间进行排列,得到目标灰度图像;进而利用目标灰度图像中各个像素点的灰度数据对原始音频对应的原始语谱图中各个像素点的频域数据进行处理,得到目标语谱图。例如,以获取到三个二维图像为例,根据步骤S102的操作得到该三个二维图像对应的目标灰度图像M1、M2、M3,间隔相邻时间采集的两个二维图像的目标灰度图像做差,从而得到两个灰度差值:M2-M1、M3-M2,将这两个灰度差值按时间顺序进行排列,从而得到该多张二维图像对应的目标灰度图像。再根据步骤S502获取一个原始音频的原始语谱图,根据步骤S503的操作将该目标灰度图像作为加权因子对原始语谱图的频域数据进行加权,从而得到目标语谱图,再根据目标语谱图得到目标音频。根据这样的方法能够通过多个二维图像改造原始音频,使得原始音频中具有该视频中图像进行变化信息。
请参见图9,是本申请实施例提供的另一种音频生成方法的流程示意图。如图9所示, 该音频生成方法通过基于该二维图像的目标灰度图像构建(构造)音频的目标语谱图,进而得到目标音频,包括以下步骤S901-S903。
S901、接收用户输入的音频生成指令,响应于所述音频生成指令,获取所述二维图像的目标灰度图像。
该步骤描述可参照上述步骤S101-S102相关描述,此处不赘述。
S902、对所述灰度数据矩阵进行上下翻转处理,将翻转处理后的灰度数据矩阵作为语谱图中各个像素点的频域数据,得到目标语谱图。
其中,对灰度数据矩阵进行上下翻转处理可表示将灰度数据矩阵按照Y轴方向上下翻转。例如,灰度数据矩阵为(0.1,0.2,0.3;0.4,0.5,0.6;0.7,0.8,0.9),上下翻转处理后的灰度数据矩阵为(0.7,0.8,0.9;0.4,0.5,0.6;0.1,0.2,0.3)。
在一种实施方式中,将翻转处理后的灰度数据矩阵作为目标语谱图中各个像素点的频域数据,换句话说,就是将灰度数据矩阵的数据作为目标语谱图的对应位置的像素点数据,也就是目标语谱图中的每个像素点对应的能量值,该能量值可以通过目标语谱图中的颜色来表示,如,通过颜色深浅来表示不同能量值的大小,或通过不同色相表示的不同能量值的大小,此处不做限制。可选的,在将灰度数据矩阵作为频域数据时,可以是灰度数据矩阵的数值越大,对应目标语谱图的能量值越大。例如,预设在获得的目标语谱图中,能量值越大颜色越深,若存在灰度数据矩阵GrayP3为(0.7,0.8,0.9;0.4,0.5,0.6;0.1,0.2,0.3),那么将0.9作为语谱图对应位置的频域数据后,对应的能量值大于灰度值为比0.9小的数据转化为频域数据的能量值,从而使得到的目标语谱图中,0.9对应位置的颜色比其他数据对应位置的颜色更深,则通过这种颜色深浅关系可以在目标语谱图中表现出嵌入的二维图像。或者,在将灰度数据矩阵作为频域数据时,也可以是灰度数据矩阵数值越小,对应目标语谱图的能量越大。例如,预设在获得的目标语谱图中,能量值越大颜色越深,若存在灰度数据矩阵GrayP3为(0.7,0.8,0.9;0.4,0.5,0.6;0.1,0.2,0.3),根据公式1-GrayP3,可以得到(0.3,0.2,0.1;0.6,0.5,0.4;0.9,0.8,0.7),所以那么将0.9作为语谱图对应位置的频域数据后,对应的能量值小于比0.9小的数据转化为频域数据的能量值,从而使得得到的目标语谱图中,0.9对应位置的颜色比其他数据对应位置的颜色更浅,从而通过这种颜色深浅关系可以在目标语谱图中表现出嵌入的二维图像。
可选的,可以利用比例因子调整灰度数据矩阵的数值大小,从而调整所得到的目标语谱图的能量大小,如,进行翻转处理后的灰度数据矩阵GrayP3为(0.7,0.8,0.9;0.4,0.5,0.6;0.1,0.2,0.3),比例因子取值为1.1,则灰度数据矩阵变为(0.77,0.88,0.99;0.44,0.55,0.66;0.11,0.22,0.33)。
S903、利用所述目标语谱图生成所述目标语谱图对应的目标音频。
本步骤的描述可参照步骤S104,其中由于本实施例目标语谱图的获取方式为直接将灰度数据矩阵作为目标语谱图的频域数据,而不是将灰度数据矩阵对原始语谱图进行加权处理,则根据混叠率混叠拼接每一帧时域信号时,只需要在0-100%(不含100%)中选取混叠率进行混叠,得到完整的音频信号即可,该音频信号即为目标音频。
本申请实施例在获取目标语谱图时,用目标灰度图像的灰度数据矩阵作为频域数据,得到目标语谱图,对目标语谱图的每一帧频域数据做傅里叶变换得到时域信号,再将时域信号混叠拼接,最终得到目标音频文件,即通过构造目标语谱图的方式得到目标音频。若嵌入的二维图像为多个用于采集用户动作变化的二维图像,则可以获得该多个二维图像的特征变化带来的声音效果。由此可见,通过对语谱图的构造得到目标音频,从而实现在音频中嵌入图 像信息的目的,使得图像具备发声功能,同时音频中又可以包含了图像信息,大大地提升了音频与图像的关联性。
下面以将本申请实施例所提出方法应用于音乐播放软件,且嵌入图像为视频流中不断变化的手势图像为例,对该实施方法进行说明。在音乐播放软件中,用户利用摄像头拍摄固定机位的画面,并在摄像头面前随意挥舞自己的手指,则该视频流中包含了多个手势图像,间隔100ms采集第一手势图像和第二手势图像,对第一手势图像和第二手势图像进行步骤S201处理过程,得到第一手势图像和第二手势图像对应的目标灰度图像,计算第一手势图像和第二手势图像的目标灰度图像的差值,根据该灰度差值确定该多个手势图像对应的目标灰度图像,如第一手势图像灰度数据矩阵为(0.1,0.2,0.3;0.4,0.5,0.6;0.7,0.8,0.9),第二手势图像的灰度数据矩阵(0.11,0.23,0.34;0.48,0.56,0.64;0.78,0.89,0.92),则灰度差值为(0.01,0.02,0.04;0.08,0.06,0.04;0.08,0.09,0.02)。将对灰度数据矩阵进行上下翻转处理;将翻转处理后的灰度数据矩阵作为目标语谱图的频域数据,得到目标语谱图,其中,选取灰度数据矩阵数值越大,对应目标语谱图的能量越大的映射关系,同时,采用大小为1.1的比例因子调整灰度数据矩阵数值大小,则得到的灰度数据矩阵为(0.011,0.022,0.044;0.088,0.066,0.044;0.088,0.099,0.022)。可见,通过调整灰度数据矩阵可以调整目标语谱图的能量值大小。将目标语谱图进行步骤S904的操作,并按照混叠率60%将每一帧时域信号拼接,得到目标音频。
可选的,可以在视频流中多次进行如上操作,则可在构成的目标音频中感受到多次手势变换的音频。例如,视频流中有多个手势图像,每个图像都为间隔100ms采集,经过步骤S201处理后得到灰度数据矩阵T1,T2,T3,T4,那么就会产生灰度差值T2-T1=T12,T3-T2=T23,T4-T3=T34,将T12,T23,T34按照时间顺序排列,映射为目标语谱图,从而合成一段手势变换带来的连续的音频。根据以上方法,得到的音频体现了视频中动态图像的改变带来的声音效果,生成的音频还能够分享给其他用户,与朋友共享动态变换带来的奇特声音效果。
可以理解,上述方法实施例都是对本申请的音频生成方法的举例说明,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其他实施例的相关描述。
基于上述音频生成方法实施例的描述,本发明实施例还公开了一种音频生成装置。可选的,所述音频生成装置可以是运行于音频生成设备如终端中的一个计算机程序(包括程序代码/程序指令)。例如,该音频生成装置可以执行图1、5、9的方法。请参见图10,所述音频生成装置可以运行如下模块:
获取模块1001,用于接收用户输入的音频生成指令,所述音频生成指令用于指示用户想要在生成的目标音频中嵌入的二维图像;
获取模块1001,还用于响应于所述音频生成指令,获取所述二维图像的目标灰度图像;
处理模块1002,用于将所述目标灰度图像中各个像素点的灰度数据转换为语谱图中各个像素点的频域数据,得到目标语谱图;
处理模块1002,还用于利用所述目标语谱图生成所述目标语谱图对应的目标音频。
在一种实施方式中,处理模块1002还用于接收用户输入的音频选择指令,所述音频选择指令用于指示生成目标音频所需的原始音频,并响应于所述音频选择指令,获取所述原始音频对应的原始语谱图;在将所述目标灰度图像中各个像素点的灰度数据转换为语谱图中各个像素点的频域数据,得到目标语谱图时,可具体用于:利用所述灰度图像中各个像素点的灰度数据对所述原始语谱图中各个像素点的频域数据进行处理,得到目标语谱图。
再一种实施方式中,所述各个像素点的灰度数据为灰度数据矩阵,处理模块1002在利用 所述目标灰度图像中各个像素点的灰度数据对所述原始语谱图中各个像素点的频域数据进行处理,得到目标语谱图时,具体用于:对所述灰度数据矩阵进行上下翻转处理;将翻转处理后的灰度数据矩阵作为加权因子,对所述原始语谱图中各个像素点的频域数据进行加权,得到目标语谱图。
再一种实施方式中,所述各个像素点的灰度数据为灰度数据矩阵,处理模块1002在利用所述目标灰度图像中各个像素点的灰度数据对所述原始语谱图中各个像素点的频域数据进行处理,得到目标语谱图时,具体用于:对灰度数据矩阵进行上下翻转处理,对翻转处理后的灰度数据矩阵进行降采样处理;将降采样处理后的灰度数据矩阵作为加权因子,对原始语谱图的部分频域数据进行加权,得到目标语谱图。
再一种实施方式中,所述各个像素点的灰度数据为灰度数据矩阵,处理模块1002在将所述目标灰度图像中各个像素点的灰度数据转换为语谱图中各个像素点的频域数据,得到目标语谱图时,具体用于:对所述灰度数据矩阵进行上下翻转处理,将翻转处理后的灰度数据矩阵作为语谱图中各个像素点的频域数据,得到目标语谱图。
再一种实施方式中,处理模块1002在利用所述目标语谱图生成所述目标语谱图对应的目标音频时,具体用于:将所述目标语谱图的每一帧频域数据进行上下翻转处理,对翻转处理后的频域数据的复数取共轭;对取共轭后的每一帧频域数据进行逆傅里叶变换,得到每一帧频域数据对应的时域信号,并将各帧所述时域信号合成为目标音频。
再一种实施方式中,处理模块1002在用于获取所述二维图像的目标灰度图像时,具体用于:获取所述二维图像的原始灰度图像,对所述原始灰度图像做等比缩放处理,得到等比缩放处理后的灰度图像;对所述等比缩放处理后的灰度图像做归一化处理,得到所述二维图像的目标灰度图像。
再一种实施方式中,所述二维图像包括多个用于采集用户动作变化的二维图像;处理模块1002在获取所述二维图像的灰度图像时,具体用于:分别计算多个所述二维图像中采集时间相邻的二维图像之间的灰度差值,得到多个灰度差值;将多个所述灰度差值按照所述灰度差值对应的采集时间进行排列,得到所述目标灰度图像。
再一种实施方式中,处理模块1002还用于接收用户输入的音频播放指令;响应于所述音频播放指令,播放所述目标音频并按照所述目标音频的播放进度展示与所述播放进度对应面积的目标语谱图。
根据本发明的一个实施例,图1、5、9所示的方法所涉及的各个步骤均可以是由图10所示的音频生成装置中的各个模块来执行的。例如,图1中所示的步骤S101、S102可以由图10中所示的获取模块1001来执行,步骤S103和S104可由图10中所示的处理模块1002来执行。
根据本发明的另一个实施例,图10所示的音频生成装置中的各个模块可以分别或全部合并为一个或若干个另外的模块来构成,或者其中的某个(些)模块还可以再拆分为功能上更小的多个模块来构成,这可以实现同样的操作,而不影响本发明的实施例的技术效果的实现。上述模块是基于逻辑功能划分的,在实际应用中,一个模块的功能也可以由多个模块来实现,或者多个模块的功能由一个模块实现。在本发明的其它实施例中,基于音频生成装置也可以包括其它模块,在实际应用中,这些功能也可以由其它模块协助实现,并且可以由多个模块协作实现。
本申请实施例在接收到音频生成指令时,能够通过响应于该音频生成指令,获取用户想要在生成的目标音频中嵌入的二维图像的目标灰度图像,并将该目标灰度图像中各个像素点 的灰度数据转换为语谱图中各个像素点的频域数据,得到目标语谱图,也就是将二维图像与目标音频的目标语谱图关联起来,然后利用目标语谱图生成目标语谱图对应的目标音频,从而实现根据二维图像生成目标音频。由此可见,本申请实施例能够实现在音频中嵌入图像信息的目的,使得图像具备发声功能,同时音频中又可以包含了图像信息,大大地提升了音频与图像的关联性。
基于上述方法实施例以及装置实施例的描述,本发明实施例还提供一种音频生成设备。请参见图11,该设备至少包括处理器1101和存储器1102,处理器1101与存储器1102之间相互连接。可选的,该音频生成设备还可包括输入设备1103和/或输出设备1104。其中,处理器1101、输入设备1103、输出设备1104以及存储器1102可通过总线或其他方式连接。
存储器1102可用于存储计算机程序(或者可用于存储计算机(可读)存储介质,该计算机存储介质包括计算机程序),所述计算机程序包括程序指令,所述处理器1101被配置用于调用所述程序指令。处理器1101(或称CPU(Central Processing Unit,中央处理器))是设备的计算核心以及控制核心,其被配置用于调用所述程序指令,具体适于加载并执行程序指令从而实现上述方法流程或相应功能。输入设备1103可包括键盘、触摸屏、射频接收器或其他输入设备中的一种或多种;输出设备1104可包括显示屏(显示器),该输出设备1104还可包括扬声器、射频发送器或其他输出设备中的一种或多种。可选的,该设备还可包括内存模块、电源模块、应用客户端等等。
例如,在一个实施例中,本发明实施例所述的处理器1101可以用于进行一系列的音频生成处理,包括:接收用户输入的音频生成指令,所述音频生成指令用于指示用户想要在生成的目标音频中嵌入的二维图像;响应于所述音频生成指令,获取所述二维图像的灰度图像;将所述灰度图像中各个像素点的灰度数据转换为语谱图中各个像素点的频域数据,得到目标语谱图;利用所述目标语谱图生成所述目标语谱图对应的目标音频,等等,具体可参照上述实施例的描述,此处不赘述。
本发明实施例还提供了一种计算机(可读)存储介质,所述计算机存储介质可以是设备中的记忆设备,用于存放程序和数据。可以理解的是,此处的计算机存储介质既可以包括设备中的内置存储介质,当然也可以包括设备所支持的扩展存储介质。计算机存储介质提供存储空间,该存储空间存储了音频生成设备如终端的操作系统。并且,在该存储空间中还存放了适于被处理器1101加载并执行的程序指令,这些指令可以是一个或一个以上的计算机程序(包括程序代码)。需要说明的是,此处的计算机存储介质可以是高速RAM存储器,也可以是非不稳定的存储器11(non-volatile memory),例如至少一个磁盘存储器;可选的还可以是至少一个位于远离前述处理器1101的计算机存储介质。
在一个实施例中,可由处理器1101加载并执行计算机存储介质中程序指令,以实现上述实施例中的方法的相应步骤;例如,具体实现中,计算机存储介质中程序指令由处理器1101加载并执行如下步骤:
接收用户输入的音频生成指令,所述音频生成指令用于指示用户想要在生成的目标音频中嵌入的二维图像;
响应于所述音频生成指令,获取所述二维图像的目标灰度图像;
将所述目标灰度图像中各个像素点的灰度数据转换为语谱图中各个像素点的频域数据,得到目标语谱图;
利用所述目标语谱图生成所述目标语谱图对应的目标音频。
在一种实施方式中,所述程序指令还可由处理器1101加载并执行:接收用户输入的音频 选择指令,所述音频选择指令用于指示生成目标音频所需的原始音频,并响应于所述音频选择指令,获取所述原始音频对应的原始语谱图;在将所述目标灰度图像中各个像素点的灰度数据转换为语谱图中各个像素点的频域数据,得到目标语谱图时,所述程序指令还可由处理器1101加载并具体执行:利用所述目标灰度图像中各个像素点的灰度数据对所述原始语谱图中各个像素点的频域数据进行处理,得到目标语谱图。
再一种实施方式中,所述各个像素点的灰度数据为灰度数据矩阵,在利用所述目标灰度图像中各个像素点的灰度数据对所述原始语谱图中各个像素点的频域数据进行处理,得到目标语谱图时,所述程序指令还可由处理器1101加载并具体执行:对所述灰度数据矩阵进行上下翻转处理;将翻转处理后的灰度数据矩阵作为加权因子,对所述原始语谱图中各个像素点的频域数据进行加权,得到目标语谱图。
再一种实施方式中,所述各个像素点的灰度数据为灰度数据矩阵,在利用所述目标灰度图像中各个像素点的灰度数据对所述原始语谱图中各个像素点的频域数据进行处理,得到目标语谱图时,所述程序指令还可由处理器1101加载并具体执行:对所述灰度数据矩阵进行上下翻转处理,对翻转处理后的灰度数据矩阵进行降采样处理;将降采样处理后的灰度数据矩阵作为加权因子,对所述原始语谱图中各个像素点的频域数据进行加权,得到目标语谱图。
再一种实施方式中,所述各个像素点的灰度数据为灰度数据矩阵,在将所述目标灰度图像中各个像素点的灰度数据转换为语谱图中各个像素点的频域数据得到目标语谱图时,所述程序指令还可由处理器1101加载并具体执行:对所述灰度数据矩阵进行上下翻转处理,将翻转处理后的灰度数据矩阵作为语谱图中各个像素点的频域数据,得到目标语谱图。
再一种实施方式中,在利用所述目标语谱图生成所述目标语谱图对应的目标音频时,所述程序指令还可由处理器1101加载并具体执行:将所述目标语谱图的每一帧频域数据进行上下翻转处理,对翻转处理后的频域数据的复数取共轭;对取共轭后的每一帧频域数据进行逆傅里叶变换,得到每一帧频域数据对应的时域信号,并将各帧所述时域信号合成为目标音频。
再一种实施方式中,获取所述二维图像的目标灰度图像时,所述程序指令还可由处理器1101加载并具体执行:获取所述二维图像的原始灰度图像,对所述原始灰度图像做等比缩放处理,得到等比缩放处理后的灰度图像;对所述等比缩放处理后的灰度图像做归一化处理,得到所述二维图像的目标灰度图像。
再一种实施方式中,所述二维图像包括多个用于采集用户动作变化的二维图像,在获取所述二维图像的目标灰度图像时,所述程序指令还可由处理器1101加载并具体执行:分别计算多个所述二维图像中采集时间相邻的二维图像之间的灰度差值,得到多个灰度差值;将多个所述灰度差值按照所述灰度差值对应的采集时间进行排列,得到所述目标灰度图像。
再一种实施方式中所述程序指令还可由处理器1101加载并具体执行:接收用户输入的音频播放指令;响应于所述音频播放指令,播放所述目标音频并按照所述目标音频的播放进度展示与所述播放进度对应面积的目标语谱图。
本申请实施例在接收到音频生成指令时,能够通过响应于该音频生成指令,获取用户想要在生成的目标音频中嵌入的二维图像的目标灰度图像,并将该目标灰度图像中各个像素点的灰度数据转换为语谱图中各个像素点的频域数据,得到目标语谱图,也就是将二维图像与目标音频的目标语谱图关联起来,然后利用目标语谱图生成目标语谱图对应的目标音频,从而实现根据二维图像生成目标音频。由此可见,本申请实施例能够实现在音频中嵌入图像信息的目的,使得图像具备发声功能,同时音频中又可以包含了图像信息,大大地提升了音频与图像的关联性。
可以理解,上述描述的音频生成设备和装置的具体工作过程,可以参考前述各个实施例中的相关描述,在此不再赘述。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,所述的程序可存储于一计算机存储介质中,该计算机存储介质可以为计算机可读存储介质,该程序在执行时,可包括如上述各方法的实施例的流程。其中,所述的存储介质可为磁碟、光盘、只读存储记忆体(Read-Only Memory,ROM)或随机存储记忆体(Random Access Memory,RAM)等。
以上所揭露的仅为本申请的部分实施例,不能以此来限定本申请之权利范围,本领域普通技术人员可以理解实现上述实施例的全部或部分流程,并依本申请权利要求所作的等同变化,仍属于发明所涵盖的范围。

Claims (10)

  1. 一种音频生成方法,其特征在于,包括:
    接收用户输入的音频生成指令,所述音频生成指令用于指示用户想要在生成的目标音频中嵌入的二维图像;
    响应于所述音频生成指令,获取所述二维图像的目标灰度图像;
    将所述目标灰度图像中各个像素点的灰度数据转换为语谱图中各个像素点的频域数据,得到目标语谱图;
    利用所述目标语谱图生成所述目标语谱图对应的目标音频。
  2. 根据权利要求1所述的方法,其特征在于,所述方法还包括:
    接收用户输入的音频选择指令,所述音频选择指令用于指示生成目标音频所需的原始音频,并响应于所述音频选择指令,获取所述原始音频对应的原始语谱图;
    所述将所述目标灰度图像中各个像素点的灰度数据转换为语谱图中各个像素点的频域数据,得到目标语谱图,包括:
    利用所述目标灰度图像中各个像素点的灰度数据对所述原始语谱图中各个像素点的频域数据进行处理,得到目标语谱图。
  3. 根据权利要求2所述的方法,其特征在于,所述各个像素点的灰度数据为灰度数据矩阵,所述利用所述目标灰度图像中各个像素点的灰度数据对所述原始语谱图中各个像素点的频域数据进行处理,得到目标语谱图,包括:
    对所述灰度数据矩阵进行上下翻转处理;
    将翻转处理后的灰度数据矩阵作为加权因子,对所述原始语谱图中各个像素点的频域数据进行加权,得到目标语谱图。
  4. 根据权利要求2所述的方法,其特征在于,所述各个像素点的灰度数据为灰度数据矩阵,所述利用所述目标灰度图像中各个像素点的灰度数据对所述原始语谱图中各个像素点的频域数据进行处理,得到目标语谱图,包括:
    对所述灰度数据矩阵进行上下翻转处理,对翻转处理后的灰度数据矩阵进行降采样处理;
    将降采样处理后的灰度数据矩阵作为加权因子,对所述原始语谱图中各个像素点的频域数据进行加权,得到目标语谱图。
  5. 根据权利要求1所述的方法,其特征在于,所述各个像素点的灰度数据为灰度数据矩阵,所述将所述目标灰度图像中各个像素点的灰度数据转换为语谱图中各个像素点的频域数据,得到目标语谱图,包括:
    对所述灰度数据矩阵进行上下翻转处理,将翻转处理后的灰度数据矩阵作为语谱图中各个像素点的频域数据,得到目标语谱图。
  6. 根据权利要求1-5任一项所述的方法,其特征在于,所述利用所述目标语谱图生成所述目标语谱图对应的目标音频,包括:
    将所述目标语谱图的每一帧频域数据进行上下翻转处理,对翻转处理后的频域数据的复数取共轭;
    对取共轭后的每一帧频域数据进行逆傅里叶变换,得到每一帧频域数据对应的时域信号,并将各帧所述时域信号合成为目标音频。
  7. 根据权利要求1-5任一项所述的方法,其特征在于,所述获取所述二维图像的目标灰度图像,包括:
    获取所述二维图像的原始灰度图像,对所述原始灰度图像做等比缩放处理,得到等比缩放处理后的灰度图像;
    对所述等比缩放处理后的灰度图像做归一化处理,得到所述二维图像的目标灰度图像。
  8. 根据权利要求1-5任一项所述的方法,其特征在于,所述二维图像包括多个用于采集用户动作变化的二维图像;所述获取所述二维图像的目标灰度图像,包括:
    分别计算多个所述二维图像中采集时间相邻的二维图像之间的灰度差值,得到多个灰度差值;
    将多个所述灰度差值按照所述灰度差值对应的采集时间进行排列,得到所述目标灰度图像。
  9. 根据权利要求1-5任一项所述的方法,其特征在于,还包括:
    接收用户输入的音频播放指令;
    响应于所述音频播放指令,播放所述目标音频并按照所述目标音频的播放进度展示与所述播放进度对应面积的目标语谱图。
  10. 一种音频生成设备,其特征在于,所述设备包括:
    处理器、存储器和输入设备,其中,所述存储器用于存储计算机程序,所述计算机程序包括程序指令,所述处理器被配置用于调用所述程序指令,执行如权利要求1-9任一项所述的方法。
PCT/CN2021/138568 2021-02-27 2021-12-15 一种音频生成方法及设备 WO2022179264A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US18/238,184 US20230402054A1 (en) 2021-02-27 2023-08-25 Audio generation method, audio generation device, and storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110221372.7 2021-02-27
CN202110221372.7A CN112863481B (zh) 2021-02-27 2021-02-27 一种音频生成方法及设备

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/238,184 Continuation US20230402054A1 (en) 2021-02-27 2023-08-25 Audio generation method, audio generation device, and storage medium

Publications (1)

Publication Number Publication Date
WO2022179264A1 true WO2022179264A1 (zh) 2022-09-01

Family

ID=75990375

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/138568 WO2022179264A1 (zh) 2021-02-27 2021-12-15 一种音频生成方法及设备

Country Status (3)

Country Link
US (1) US20230402054A1 (zh)
CN (1) CN112863481B (zh)
WO (1) WO2022179264A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115188364A (zh) * 2022-09-13 2022-10-14 南开大学 基于卷积网络和编码器解码器模型的多语种语音合成方法
CN115470507A (zh) * 2022-10-31 2022-12-13 青岛他坦科技服务有限公司 一种中小企业研发项目数据管理方法

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112863481B (zh) * 2021-02-27 2023-11-03 腾讯音乐娱乐科技(深圳)有限公司 一种音频生成方法及设备
CN114338622A (zh) * 2021-12-28 2022-04-12 歌尔光学科技有限公司 一种音频传输方法、音频播放方法、存储介质及相关设备

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106971366A (zh) * 2017-02-08 2017-07-21 北京印刷学院 一种在音频信号中加入及提取水印的方法
CN108615006A (zh) * 2018-04-23 2018-10-02 百度在线网络技术(北京)有限公司 用于输出信息的方法和装置
US20190348062A1 (en) * 2018-05-08 2019-11-14 Gyrfalcon Technology Inc. System and method for encoding data using time shift in an audio/image recognition integrated circuit solution
CN111312287A (zh) * 2020-02-21 2020-06-19 腾讯音乐娱乐科技(深圳)有限公司 一种音频信息的检测方法、装置及存储介质
CN111862932A (zh) * 2020-07-02 2020-10-30 北京科技大学 一种将图像转化为声音的可穿戴助盲系统及方法
CN112188115A (zh) * 2020-09-29 2021-01-05 咪咕文化科技有限公司 一种图像处理方法、电子设备及存储介质
CN112863481A (zh) * 2021-02-27 2021-05-28 腾讯音乐娱乐科技(深圳)有限公司 一种音频生成方法及设备

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE10214431B4 (de) * 2002-03-30 2005-11-10 Ralf Dringenberg Verfahren und Vorrichtung zur Visualisierung von Audiodaten
US20040260540A1 (en) * 2003-06-20 2004-12-23 Tong Zhang System and method for spectrogram analysis of an audio signal
CN112151048B (zh) * 2019-06-11 2024-04-02 李庆成 音视图数据生成以及处理的方法
CN111048071B (zh) * 2019-11-11 2023-05-30 京东科技信息技术有限公司 语音数据处理方法、装置、计算机设备和存储介质
CN111489762B (zh) * 2020-05-13 2023-06-16 广州国音智能科技有限公司 三维语谱图生成方法、装置、终端及存储介质

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106971366A (zh) * 2017-02-08 2017-07-21 北京印刷学院 一种在音频信号中加入及提取水印的方法
CN108615006A (zh) * 2018-04-23 2018-10-02 百度在线网络技术(北京)有限公司 用于输出信息的方法和装置
US20190348062A1 (en) * 2018-05-08 2019-11-14 Gyrfalcon Technology Inc. System and method for encoding data using time shift in an audio/image recognition integrated circuit solution
CN111312287A (zh) * 2020-02-21 2020-06-19 腾讯音乐娱乐科技(深圳)有限公司 一种音频信息的检测方法、装置及存储介质
CN111862932A (zh) * 2020-07-02 2020-10-30 北京科技大学 一种将图像转化为声音的可穿戴助盲系统及方法
CN112188115A (zh) * 2020-09-29 2021-01-05 咪咕文化科技有限公司 一种图像处理方法、电子设备及存储介质
CN112863481A (zh) * 2021-02-27 2021-05-28 腾讯音乐娱乐科技(深圳)有限公司 一种音频生成方法及设备

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115188364A (zh) * 2022-09-13 2022-10-14 南开大学 基于卷积网络和编码器解码器模型的多语种语音合成方法
CN115470507A (zh) * 2022-10-31 2022-12-13 青岛他坦科技服务有限公司 一种中小企业研发项目数据管理方法
CN115470507B (zh) * 2022-10-31 2023-02-07 青岛他坦科技服务有限公司 一种中小企业研发项目数据管理方法

Also Published As

Publication number Publication date
US20230402054A1 (en) 2023-12-14
CN112863481B (zh) 2023-11-03
CN112863481A (zh) 2021-05-28

Similar Documents

Publication Publication Date Title
WO2022179264A1 (zh) 一种音频生成方法及设备
US20230316643A1 (en) Virtual role-based multimodal interaction method, apparatus and system, storage medium, and terminal
WO2022116977A1 (zh) 目标对象的动作驱动方法、装置、设备及存储介质及计算机程序产品
US10776977B2 (en) Real-time lip synchronization animation
KR101874895B1 (ko) 증강 현실 제공 방법 및 이를 지원하는 단말기
JP7387890B2 (ja) 動画ファイルの生成方法、装置、端末及び記憶媒体
KR102497549B1 (ko) 오디오 신호 처리 방법 및 장치, 저장 매체
CN111259841B (zh) 一种图像处理方法及相关设备
WO2021213008A1 (zh) 一种视频的音画匹配方法、相关装置以及存储介质
WO2022007565A1 (zh) 增强现实的图像处理方法、装置、电子设备及存储介质
WO2022042290A1 (zh) 一种虚拟模型处理方法、装置、电子设备和存储介质
CN112785670B (zh) 一种形象合成方法、装置、设备及存储介质
WO2022055421A1 (zh) 基于增强现实的显示方法、设备及存储介质
CN113822972B (zh) 基于视频的处理方法、设备和可读介质
CN113923462A (zh) 视频生成、直播处理方法、设备和可读介质
EP4192021A1 (en) Audio data processing method and apparatus, and device and storage medium
CN113313797A (zh) 虚拟形象驱动方法、装置、电子设备和可读存储介质
WO2022218042A1 (zh) 视频处理方法、装置、视频播放器、电子设备及可读介质
WO2023273697A1 (zh) 图像处理方法、模型训练方法、装置、电子设备及介质
CN112785669B (zh) 一种虚拟形象合成方法、装置、设备及存储介质
EP4138381A1 (en) Method and device for video playback
CN116248811B (zh) 视频处理方法、装置及存储介质
JPWO2013008869A1 (ja) 電子機器及びデータ生成方法
CN114063965A (zh) 高解析音频生成方法、电子设备及其训练方法
CN115714888B (zh) 视频生成方法、装置、设备与计算机可读存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21927688

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 11.12.2023)