CN114827886A - Audio generation method and device, electronic equipment and storage medium - Google Patents

Audio generation method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN114827886A
CN114827886A CN202210448723.2A CN202210448723A CN114827886A CN 114827886 A CN114827886 A CN 114827886A CN 202210448723 A CN202210448723 A CN 202210448723A CN 114827886 A CN114827886 A CN 114827886A
Authority
CN
China
Prior art keywords
audio
information
signal
dimensional
signals
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210448723.2A
Other languages
Chinese (zh)
Inventor
陈联武
郑羲光
范欣悦
张晨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Dajia Internet Information Technology Co Ltd
Original Assignee
Beijing Dajia Internet Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Dajia Internet Information Technology Co Ltd filed Critical Beijing Dajia Internet Information Technology Co Ltd
Priority to CN202210448723.2A priority Critical patent/CN114827886A/en
Publication of CN114827886A publication Critical patent/CN114827886A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S7/00Indicating arrangements; Control arrangements, e.g. balance control
    • H04S7/30Control circuits for electronic adaptation of the sound field
    • H04S7/302Electronic adaptation of stereophonic sound system to listener position or orientation
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S2400/00Details of stereophonic systems covered by H04S but not provided for in its groups
    • H04S2400/01Multi-channel, i.e. more than two input channels, sound reproduction with two speakers wherein the multi-channel information is substantially preserved
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S2400/00Details of stereophonic systems covered by H04S but not provided for in its groups
    • H04S2400/11Positioning of individual sound objects, e.g. moving airplane, within a sound field

Landscapes

  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Acoustics & Sound (AREA)
  • Signal Processing (AREA)
  • Stereophonic System (AREA)

Abstract

The disclosure provides an audio generation method, an audio generation device, an electronic device and a storage medium. The audio generation method may include: acquiring an audio signal to be processed; obtaining a plurality of soundtrack signals for a plurality of sound sources by soundtrack separation of an audio signal to be processed; determining user orientation information in a three-dimensional space and generating three-dimensional space metadata corresponding to each of a plurality of soundtrack signals; and generating a three-dimensional audio signal corresponding to the audio signal to be processed based on the user orientation information, the separated plurality of audio track signals and the three-dimensional spatial metadata corresponding to each audio track signal. The method and the device can generate the three-dimensional audio which is closer to the real three-dimensional audio, increase the immersion feeling of the three-dimensional audio and improve the user experience.

Description

Audio generation method and device, electronic equipment and storage medium
Technical Field
The present disclosure relates to the field of audio processing technologies, and in particular, to an audio generating method, an audio generating apparatus, an electronic device, a storage medium, and a program product.
Background
Three-dimensional 3D audio refers to audio signals that are distributed in a three-dimensional space and are played dynamically, for example, as music is played, human voice moves from far to near in the three-dimensional space, and musical instruments such as drum sound and bass bounce back and forth in different places in the three-dimensional space.
However, currently, the mainstream audio is basically in a two-channel stereo format or a single-channel format, and the audio content in the multi-channel and 3D formats is relatively poor. Therefore, automatic conversion of conventional stereo audio and the like into 3D audio is an important direction of 3D audio development.
Disclosure of Invention
The present disclosure provides an audio generation method, an audio generation apparatus, an electronic device, and a storage medium to solve at least the above-mentioned problems.
According to a first aspect of embodiments of the present disclosure, there is provided an audio generation method, which may include: acquiring an audio signal to be processed; obtaining a plurality of soundtrack signals for a plurality of sound sources by soundtrack separation of the audio signal to be processed; determining user orientation information in a three-dimensional space, and generating three-dimensional space metadata corresponding to each of the plurality of soundtrack signals, wherein the three-dimensional space metadata includes orientation information and sound width information of a corresponding sound source in the three-dimensional space; generating a three-dimensional audio signal corresponding to the audio signal to be processed based on the user orientation information, the separated plurality of audio track signals, and the three-dimensional spatial metadata corresponding to each of the audio track signals.
As an example, generating three-dimensional spatial metadata corresponding to each soundtrack signal of the plurality of soundtrack signals may comprise: acquiring feature information of the audio signal to be processed, wherein the feature information includes at least one of beat information and structure information of the audio signal to be processed, and the structure information includes type information of each audio clip of the audio signal to be processed; and respectively generating three-dimensional space metadata corresponding to each audio track signal based on the characteristic information.
As an example, generating three-dimensional spatial metadata corresponding to each of the audio track signals respectively based on the feature information may include: for a human voice signal in the plurality of audio track signals, according to the type of each audio clip in the structure information, determining position adjustment information of a sound source corresponding to the human voice signal relative to the user orientation information in a three-dimensional space, and determining three-dimensional space metadata of the human voice signal based on the position adjustment information.
As an example, generating three-dimensional spatial metadata corresponding to each of the audio track signals respectively based on the feature information may include: and for the musical instrument signals in the plurality of music track signals, determining the movement information of the sound source corresponding to the musical instrument signals in the three-dimensional space according to the beat information and the types of the audio fragments in the structure information, and determining the three-dimensional space metadata of the musical instrument signals based on the movement information.
As an example, generating three-dimensional spatial metadata corresponding to each soundtrack signal of the plurality of soundtrack signals may comprise: determining a preset template corresponding to each audio track signal from a plurality of preset templates, wherein the preset template comprises at least one of movement track information, movement speed information and sound width change information of a corresponding sound source in a three-dimensional space; and respectively generating three-dimensional space metadata for each audio track signal by using the determined preset template.
As an example, generating three-dimensional spatial metadata corresponding to each soundtrack signal of the plurality of soundtrack signals may comprise: acquiring setting information input by a user, wherein the setting information comprises at least one of a moving track, a moving speed and a sound width variation value of each sound source corresponding to the plurality of sound track signals in a three-dimensional space; generating three-dimensional spatial metadata for each of the track signals, respectively, based on the setting information.
As an example, generating a three-dimensional audio signal corresponding to the audio signal to be processed based on the user orientation information, the separated plurality of soundtrack signals, and the three-dimensional spatial metadata corresponding to each of the soundtrack signals may include: identifying a type of a playback device for playing back the three-dimensional audio signal; obtaining a rendering strategy corresponding to the type of the playing device, and generating a three-dimensional audio signal corresponding to the audio signal to be processed through the rendering strategy based on the user orientation information, the separated plurality of audio track signals and the three-dimensional spatial metadata corresponding to each audio track signal.
As an example, generating a three-dimensional audio signal corresponding to the audio signal to be processed by the rendering strategy may include: when the playback apparatus is an in-ear playback apparatus, for each track signal, generating a three-dimensional audio signal corresponding to the track signal based on orientation information corresponding to each audio frame of the track signal and the user orientation information; adjusting a sound width of a three-dimensional audio signal corresponding to the track signal based on the sound width information of each audio frame.
As an example, generating a three-dimensional audio signal corresponding to the audio signal to be processed by the rendering strategy may include: when the playback apparatus is an external playback apparatus, rendering, for each track signal, the track signal based on azimuth information of a sound source corresponding to the track signal and azimuth information of a plurality of speakers to generate a three-dimensional audio signal corresponding to the track signal; adjusting a sound width of a three-dimensional audio signal corresponding to the track signal based on sound width information of the sound source.
According to a second aspect of embodiments of the present disclosure, there is provided an audio generating apparatus, which may include: an acquisition module configured to acquire an audio signal to be processed; a sound track separation module configured to obtain a plurality of sound track signals for a plurality of sound sources by performing sound track separation on the audio signal to be processed; a metadata generation module configured to determine user orientation information in a three-dimensional space and generate three-dimensional space metadata corresponding to each of the plurality of soundtrack signals, wherein the three-dimensional space metadata includes orientation information and sound width information of a corresponding sound source in the three-dimensional space; a rendering module configured to generate a three-dimensional audio signal corresponding to the audio signal to be processed based on the user orientation information, the separated plurality of audio track signals, and the three-dimensional spatial metadata corresponding to each of the audio track signals.
As an example, the metadata generation module may be configured to: acquiring feature information of the audio signal to be processed, wherein the feature information includes at least one of beat information and structure information of the audio signal to be processed, and the structure information includes type information of each audio clip of the audio signal to be processed; and respectively generating three-dimensional space metadata corresponding to each audio track signal based on the characteristic information.
As an example, the metadata generation module may be configured to: for a human voice signal in the plurality of audio track signals, according to the type of each audio clip in the structure information, determining position adjustment information of a sound source corresponding to the human voice signal relative to the user orientation information in a three-dimensional space, and determining three-dimensional space metadata of the human voice signal based on the position adjustment information.
As an example, the metadata generation module may be configured to: and for the musical instrument signals in the plurality of music track signals, determining the movement information of the sound source corresponding to the musical instrument signals in the three-dimensional space according to the beat information and the types of the audio fragments in the structure information, and determining the three-dimensional space metadata of the musical instrument signals based on the movement information.
As an example, the metadata generation module may be configured to: determining a preset template corresponding to each audio track signal from a plurality of preset templates, wherein the preset template comprises at least one of movement track information, movement speed information and sound width change information of a corresponding sound source in a three-dimensional space; and respectively generating three-dimensional space metadata for each audio track signal by using the determined preset template.
As an example, the metadata generation module may be configured to: acquiring setting information input by a user, wherein the setting information comprises at least one of a moving track, a moving speed and a sound width variation value of each sound source corresponding to the plurality of sound track signals in a three-dimensional space; generating three-dimensional spatial metadata for each of the track signals, respectively, based on the setting information.
As one example, the rendering module may be configured to: identifying a type of a playback device for playing back the three-dimensional audio signal; obtaining a rendering strategy corresponding to the type of the playing device, and generating a three-dimensional audio signal corresponding to the audio signal to be processed through the rendering strategy based on the user orientation information, the separated plurality of audio track signals and the three-dimensional spatial metadata corresponding to each audio track signal.
As one example, the rendering module may be configured to: when the playback apparatus is an in-ear playback apparatus, for each track signal, generating a three-dimensional audio signal corresponding to the track signal based on orientation information corresponding to each audio frame of the track signal and the user orientation information; adjusting a sound width of a three-dimensional audio signal corresponding to the track signal based on the sound width information of each audio frame.
As one example, the rendering module may be configured to: when the playback apparatus is an external playback apparatus, rendering, for each track signal, the track signal based on azimuth information of a sound source corresponding to the track signal and azimuth information of a plurality of speakers to generate a three-dimensional audio signal corresponding to the track signal; adjusting a sound width of a three-dimensional audio signal corresponding to the track signal based on sound width information of the sound source.
According to a third aspect of the embodiments of the present disclosure, there is provided an electronic apparatus, which may include: at least one processor; at least one memory storing computer-executable instructions, wherein the computer-executable instructions, when executed by the at least one processor, cause the at least one processor to perform the audio generation method as described above.
According to a fourth aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium storing instructions that, when executed by at least one processor, cause the at least one processor to perform the audio generation method as described above.
According to a fifth aspect of embodiments of the present disclosure, there is provided a computer program product, instructions of which are executed by at least one processor in an electronic device to perform the audio generation method as described above.
The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:
the audio to be processed is converted into the three-dimensional audio by separating the audio track signals of a plurality of single sound sources from the audio to be processed and giving various different three-dimensional space tracks to the sound sources, so that the three-dimensional audio closer to the real three-dimensional audio is obtained, the immersion feeling of the three-dimensional audio is increased, and the user experience is improved.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure and are not to be construed as limiting the disclosure.
Fig. 1 is a flow diagram of an audio generation method according to an embodiment of the present disclosure;
fig. 2 and 3 show schematic diagrams of a virtual three-dimensional space according to an embodiment of the present disclosure;
FIG. 4 is a schematic flow diagram of an audio generation method according to another embodiment of the present disclosure;
fig. 5 is a schematic structural diagram of an audio generating device according to an embodiment of the present disclosure;
fig. 6 is a block diagram of an audio generation apparatus according to an embodiment of the present disclosure;
fig. 7 is a block diagram of an electronic device according to an embodiment of the disclosure.
Throughout the drawings, it should be noted that the same reference numerals are used to designate the same or similar elements, features and structures.
Detailed Description
In order to make the technical solutions of the present disclosure better understood by those of ordinary skill in the art, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.
The following description with reference to the accompanying drawings is provided to assist in a comprehensive understanding of the embodiments of the disclosure as defined by the claims and their equivalents. Various specific details are included to aid understanding, but these are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. In addition, descriptions of well-known functions and constructions are omitted for clarity and conciseness.
The terms and words used in the following description and claims are not limited to the written meaning, but are used only by the inventors to achieve a clear and consistent understanding of the disclosure. Accordingly, it should be apparent to those skilled in the art that the following descriptions of the various embodiments of the present disclosure are provided for illustration only and not for the purpose of limiting the disclosure as defined by the appended claims and their equivalents.
It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.
In the related art, a direct sound and a background sound are separated from a stereo signal using a conventional signal processing scheme, and then the direct sound and the background sound are differently processed to generate a 3D audio signal. However, the track separation algorithm based on the conventional signal processing has a limited effect and cannot purposely separate track signals of a single sound source such as human voice and various musical instruments, and the immersion of the finally generated 3D audio is significantly different compared to real 3D audio. In addition, the audio signal can be rotated to a certain degree in a three-dimensional space based on the audio stuck point detection result, and a 3D audio effect is constructed. However, this approach does not address each sound element on the basis of a soundtrack separation algorithm, e.g. when the drum sound rotates, the human voice signal must also rotate together, which results in a single generated 3D audio effect.
The present disclosure combines a series of techniques such as audio stuck point detection, audio track separation, and music structure analysis to give different three-dimensional spatial trajectories to audio tracks such as human voice, drum sound, bass, etc. in audio, convert traditional stereo audio into 3D audio signals, and finally generate immersive 3D audio content through audio space rendering techniques.
Hereinafter, according to various embodiments of the present disclosure, a method and apparatus of the present disclosure will be described in detail with reference to the accompanying drawings.
Fig. 1 is a flowchart of an audio generation method according to an embodiment of the present disclosure. The audio generation method of fig. 1 may be used to convert conventional stereo audio, single channel audio, etc. into three-dimensional audio.
The audio generation method according to the present disclosure may be performed by any electronic device. The electronic device may be at least one of a smartphone, a tablet, a laptop computer, a desktop computer, and the like. The electronic device may be installed with a target application for implementing the three-dimensional audio generation method of the present disclosure.
Referring to fig. 1, in step S101, an audio signal to be processed is acquired. Here, the audio signal to be processed may be, for example, a stereo music signal, a single-channel music signal, or the like.
In step S102, a plurality of track signals for a plurality of sound sources are obtained by performing track separation on an audio signal to be processed.
A deep learning based soundtrack separation system may be used to perform soundtrack separation on the audio signal to be processed. The track separation technology based on deep learning can separate human voice and each instrument signal with higher reduction degree. For example, a deep learning based soundtrack separation system may be comprised of an encoder, a separation module, and a decoder. The method comprises the steps of firstly carrying out short-time Fourier transform on an audio signal to be processed to obtain frequency spectrum data of the audio signal, inputting the frequency spectrum data into an encoder to obtain encoding characteristics of the audio signal, further extracting target audio track characteristics from the encoding characteristics by using a separation module, and finally obtaining a target masking matrix corresponding to the target audio track signal through a decoder based on the target audio track characteristics. After the frequency spectrum data is multiplied by the target masking matrix, short-time inverse Fourier transform is carried out on the multiplication result, and a target audio track signal can be obtained. Here, the target track signal may include one or more track signals. For example, after the above-described process is performed on a conventional stereo music signal, a plurality of track signals such as human voice, drum sound, bass, and other musical instruments are available.
In step S103, user orientation information in a three-dimensional space is determined, and three-dimensional space metadata for each of the plurality of separated track signals is generated. The user orientation information may include at least a part of three-dimensional position coordinates, a direction, a speed, a moving path, and the like of the virtual user at each time point or each time period in the three-dimensional space. For example, the user orientation information may be fixed to a central position in three-dimensional space. Three-dimensional spatial metadata for each sound source may be determined with reference to user orientation information. The three-dimensional spatial metadata of one soundtrack signal may be used to indicate how the corresponding sound source changes in three-dimensional space, e.g. may indicate how the sound source moves in three-dimensional space, sound width changes of the sound source in three-dimensional space, etc.
The three-dimensional spatial metadata may include azimuth information and sound width information of the corresponding sound source in a three-dimensional space. Here, the azimuth information of the sound source may include a part or all of three-dimensional position coordinates, a direction, a speed, a moving path, and the like of the corresponding sound source at each time point or each time period in the three-dimensional space.
Fig. 2 and 3 show schematic diagrams of a virtual three-dimensional space according to embodiments of the present disclosure.
Assuming that there exists a virtual three-dimensional space whose three-dimensional coordinate axes are an X-axis, a Y-axis, and a Z-axis, where the Z-axis represents a height, the range of each coordinate axis is set to [0,1], as shown in fig. 2.
The center position of the user/listener in the three-dimensional space, such as (0.5,0.5,0.5), white dots in fig. 3 may represent respective sound sources, such as vocal and musical instruments, the coordinates of the dots may represent the three-dimensional spatial position of the sound source, and the size of the dots may represent the width of the sound. However, the above examples are merely exemplary, and the present disclosure is not limited thereto. Further, the sound source position in the three-dimensional spatial metadata may be a position relative to the user position.
According to an embodiment of the present disclosure, three-dimensional spatial metadata for each track signal included in an audio signal to be processed may be generated based on feature information of the audio signal to be processed. The feature information may include at least one of tempo information and structure information of the audio signal to be processed.
Beat information may be obtained by performing audio stuck-point detection on an audio signal to be processed. For example, the audio stuck-point detection may be implemented by a deep learning-based model, which may mainly include a feature extraction module, a deep learning-based probability prediction module, and a global beat position estimation module. The beat position of the audio signal can be accurately judged based on the deep learning stuck point detection.
The feature extraction generally uses frequency domain features, for example, the mel spectrum of the audio signal to be processed and the first order difference thereof may be used as input features of the feature extraction module, and then the extracted features are input to the probability prediction module. The probability prediction module can be realized by using a deep network such as CRNN and the like, and learns the local characteristics and the time sequence characteristics of the audio signal to be processed. By the probability prediction module, the probability of whether it is a beat point can be calculated for each frame of audio data. And finally, based on the prediction probability, obtaining the globally optimal beat position by using a global beat position estimation module. The global beat location estimation module may be implemented by a dynamic programming algorithm. The generated beat bits may include both normal beat and reprint types. The above-described process of audio stuck point detection is merely exemplary, and the present disclosure is not limited thereto.
The structural information of the audio signal may be obtained by performing audio structural analysis on the audio signal to be processed. The structure information may include type information and time information of each audio piece of the audio signal. For example, the audio structure analysis technique refers to an algorithm for dividing an audio signal into different types of segments, such as several different types of segments, e.g., an introduction, a verse, a refrain, and a transition segment. For example, the audio structure analysis process mainly includes several steps of segmentation, clustering, and identification. Taking a music signal as an example, an input music signal is first subjected to framing processing, and spectral features (such as mel-frequency cepstrum coefficients) of a speech frame are extracted. By calculating the characteristic correlation between frames, a correlation matrix of the music signal can be obtained. The speech signal may be segmented into segments according to a correlation matrix, and the segments may be clustered according to the correlation between the segments. After passing through the segmentation and clustering process, the point in time of the music structure and the corresponding segment, such as in the form of a-b-c-b-c, can be found for the input music signal. And finally, judging a song master part, a song deputy part and the like in the music signal based on the repetition times, the volume, the brightness and other acoustic characteristics of each segment. The above-described process of audio structure analysis is merely exemplary, and the present disclosure is not limited thereto.
After obtaining the beat information of the audio signal to be processed and the structure information of each audio clip, corresponding three-dimensional spatial metadata may be generated for different audio track signals.
For the human voice signal in the separated multiple audio track signals, according to the type of each audio clip in the structure information, the position adjustment information of the sound source corresponding to the human voice signal relative to the user direction information in the three-dimensional space can be determined, and the three-dimensional space metadata of the human voice signal can be determined based on the position adjustment information.
As an example, for a human voice signal in a soundtrack signal, a sound source corresponding to the human voice signal may be set to move in a three-dimensional space to a position where a user is located during a first preset type of audio clip, and information related to the movement may be taken as three-dimensional spatial metadata of the human voice signal.
Taking music signals as an example, for human voice signals, the distance between a sound source corresponding to the human voice signals and a listener can be changed from far to near in a three-dimensional space at a song-giving part. For example, the sound source gradually comes closer to the listener from the coordinates (0.5,0, 0.5) to (0.5, 0.3, 0.5) in the three-dimensional space of fig. 2, thereby increasing the sense of substitution of music.
As another example, for a human voice signal in a soundtrack signal, a height coordinate of a sound source corresponding to the human voice signal in a three-dimensional space may be increased to a predetermined height and a sound width of the sound source may be increased to a predetermined sound width during a second preset type of audio clip, and information related to the increase to the predetermined height and the predetermined sound width may be used as three-dimensional spatial metadata of the human voice signal.
Taking the music signal as an example, for the human voice signal, the height of the sound source corresponding to the human voice signal in the three-dimensional coordinate and the width of the whole sound can be gradually increased in the three-dimensional space at the transition section of the main song part and the refrain part, so that the sound source finally reaches a certain height and sound width value at the refrain part, and the shocking effect of the refrain is improved. For example, the position coordinates of the sound source are changed from (0.5,0, 0.5) to (0.5,0, 1) in the three-dimensional space of fig. 2, and the sound width value is changed from 0.05 to 0.5. The variation of the height value and the width value may be changed following a preset linear function or a non-linear function.
For the instrument signals in the separated multiple track signals, the movement information of the sound source corresponding to the instrument signals in the three-dimensional space can be determined according to the beat information and the types of the audio segments in the structure information, and the three-dimensional space metadata of the instrument signals can be determined based on the movement information.
As an example, for an instrument signal in a track signal, a sound source corresponding to the instrument signal may be set to periodically move in a three-dimensional space according to a predetermined trajectory according to tempo information, and information related to the movement may be used as three-dimensional spatial metadata of the instrument signal. Taking a music signal as an example, a sound source corresponding to musical instruments such as drumbeats and bass may be periodically changed in a specific trajectory in a three-dimensional space according to beats. For example, the sound source of the musical instrument can be rotated in a three-dimensional space according to a specific track, and the sound source is positioned in the middle of the head of the listener when the music is rephotographed, so that the auditory experience of the listener is further improved.
As yet another example, for an instrument signal in a track signal, a rotation speed of a sound source corresponding to the instrument signal in a three-dimensional space may be increased to a predetermined rotation speed during a second preset type of audio piece, and information related to the increase to the predetermined rotation speed may be taken as three-dimensional spatial metadata of the instrument signal. By taking music signals as an example, for sound sources corresponding to musical instruments such as drumbeats and bass, the spatial rotation speed of the sound sources of the musical instruments at the refrain part can be improved according to the characteristics of the verse and the refrain, and therefore the dynamic sense of the refrain is improved.
According to another embodiment of the present disclosure, a preset template respectively corresponding to each of the separated audio track signals may be determined from a plurality of preset templates, and three-dimensional spatial metadata for each of the audio track signals may be respectively generated using the determined preset template.
Each preset template may include at least one of movement trajectory information, movement speed information, and sound width variation information of the sound source in the three-dimensional space. The preset template may preset a change process of the sound source in the three-dimensional space, for example, how the sound source moves in the three-dimensional space, a sound width change of the sound source in the three-dimensional space, and the like. After the audio signal to be processed is separated into a plurality of audio track signals, a preset template may be assigned for each separated audio track signal, so that the audio track signals are changed according to information in the corresponding template. For example, a preset template matching the soundtrack signal may be determined from a plurality of preset templates based on the characteristics of each soundtrack signal.
As an example, the preset templates may include templates for moving the distance between the human voice source and the listener from far to near in the chorus part, gradually increasing the height of the human voice source in the three-dimensional coordinates and the width of the whole sound in the transition section of the master and the refrain, periodically changing the instrument voice source in a specific trajectory in the three-dimensional space according to the tempo, increasing the spatial rotation speed of the instrument voice source in the refrain part, and the like. However, the above examples are merely exemplary, and the present disclosure is not limited thereto. By applying corresponding preset templates to different audio track signals, three-dimensional space metadata which better accords with the attributes of each audio track signal is generated, so that more vivid three-dimensional audio signals can be generated.
According to another embodiment of the present disclosure, setting information input by a user may be acquired, and the setting information may include at least one of a movement trajectory, a movement speed, and a sound width variation value of each sound source corresponding to a plurality of track signals in a three-dimensional space, and then three-dimensional spatial metadata for each track signal may be separately generated based on the setting information.
As an example, three-dimensional spatial metadata for each of the separated audio track signals may be separately generated based on user input. The user input may be used to set at least one of a moving trajectory, a moving speed, and a sound width variation value of each sound source corresponding to the plurality of track signals in the three-dimensional space. The user can freely define the variations for each audio track signal according to the understanding of the audio content to be processed.
Taking the music signal as an example, the user can customize the movement track of the bass sound source in the music in the three-dimensional space when playing the chorus part according to the understanding of the user on the music content, and the user-defined movement track can be applied to the bass sound source when generating the three-dimensional music signal. However, the above examples are merely exemplary, and the present disclosure is not limited thereto. The user can set the change condition of each sound source in the three-dimensional space according to the preference and the understanding of the music content of the user, so that the expected music effect of the user is obtained, the user demand is met, and the user experience is improved.
According to still another example of the present disclosure, three-dimensional spatial metadata for a plurality of separated track signals may be generated using a preset template based on tempo information of an audio signal to be processed and each audio clip. That is, the sound source corresponding to each track signal can be changed in the three-dimensional space according to the preset template at the corresponding beat positions and segment portions.
In step S104, a three-dimensional audio signal corresponding to the audio signal to be processed is generated based on the user orientation information, the separated plurality of track signals, and the three-dimensional spatial metadata corresponding to each track signal. The generated three-dimensional audio signal may include separate soundtrack signals and corresponding three-dimensional spatial metadata for each soundtrack signal, and a final 3D audio effect may be achieved through spatial audio rendering techniques.
According to an embodiment of the present disclosure, when generating a three-dimensional audio signal, the three-dimensional audio signal may be rendered in different ways for different playback devices, taking into account the type of playback device used to play the three-dimensional audio signal.
Specifically, the type of a playback apparatus for playing back a three-dimensional audio signal may be identified, a rendering policy corresponding to the type of the playback apparatus may be acquired, and then a three-dimensional audio signal corresponding to an audio signal to be processed may be generated through the acquired rendering policy based on user orientation information, the separated plurality of audio track signals, and three-dimensional spatial metadata corresponding to each audio track signal.
With the in-ear playback apparatus, it is possible to generate, for each separated track signal, a three-dimensional audio signal corresponding to the track signal based on the orientation information corresponding to each audio frame of the track signal and the user orientation information, and adjust the sound width of the three-dimensional audio signal corresponding to the track signal based on the sound width information of each audio frame. For example, each audio frame of the soundtrack signal may be convolved with its corresponding HRTF in three-dimensional coordinates based on a Head Related Transfer Function (HRTF) to obtain a corresponding binaural audio signal.
With the play-out apparatus, the track signal may be rendered for each track signal based on the azimuth information of the sound source corresponding to the track signal and the azimuth information of the plurality of speakers to generate a three-dimensional audio signal corresponding to the track signal, and the sound width of the three-dimensional audio signal corresponding to the track signal may be adjusted based on the sound width information of the sound source. For example, a unit direction vector of sound may be expressed as a linear combination of unit direction vectors of a plurality of speakers closest to a sound source direction based on a vector basis amplitude phase shift (VBAP) technique, and a gain factor of each speaker may be calculated, thereby rendering a 3D musical effect for the plurality of speakers.
In consideration of the types of the playback devices, three-dimensional audio signals that make the playback effects of the playback devices of the respective types better can be generated.
The 3D audio generation method can automatically convert traditional dual-channel stereo music into 3D music, and increases the immersion of music.
Fig. 4 is a flowchart illustration of an audio generation method according to another embodiment of the present disclosure. In fig. 4, a description is given taking an example of converting stereo music/mono music into 3D music. However, the system shown in fig. 4 may also be used to convert any form of audio into 3D audio.
The input single-channel or stereo music is passed through a track separation module to extract a plurality of preset track signals, such as audio signals of sound sources such as human voice, drum sound and bass. Meanwhile, the input single-channel or stereo music extracts the beat point information of the music through the audio frequency stuck point detection module, and extracts the structure information of music fragments of types such as the main song, the chorus and the like through the audio frequency structure analysis module. And based on the click information and the music structure information, determining the three-dimensional space metadata of each audio track signal through a three-dimensional metadata generation module according to some self-defined templates or user editing. The spatial audio rendering module may obtain a final 3D music signal based on the separated audio track signal and the corresponding three-dimensional spatial metadata using a spatial audio rendering technique.
The audio track separation module may be implemented in a manner based on deep learning, and may include, for example, an encoder, a decoder, and a separator. The method comprises the steps of inputting time domain music signals, obtaining corresponding music spectrum signals through an STFT module, further extracting audio track characteristics through a separator after the music spectrum signals pass through an encoder formed by a plurality of layers of convolution layers, and finally obtaining a target masking matrix corresponding to a target audio track signal through a decoder. After multiplying the music frequency spectrum signal by the target masking matrix, the target audio track signal, such as human voice, drum voice, bass and other musical instruments, can be obtained through the ISTFT module.
The audio stuck point detection module can be implemented in a deep learning-based manner, and for example, can include a feature extraction module, a probability prediction module based on a depth model, and a global beat position estimation module. First, feature extraction typically uses frequency domain features, and in one implementation, the mel-frequency spectrum and its first order difference are typically used as input features. The probability prediction module usually selects a deep network implementation such as CRNN to learn local features and time sequence features, and the probability prediction module can calculate the probability of whether the audio data of each frame is a beat point. And finally, based on the prediction probability, the global beat position estimation module obtains a globally optimal beat position by using a dynamic programming algorithm. The generated beat bits may include both normal beats and reprints.
The music structure analysis module can divide the music signal into different segments through an algorithm, such as several different parts of an introduction, a verse, a refrain and a transition segment. The music structure analysis process mainly comprises the steps of segmentation, clustering, identification and the like. Firstly, the music signal is subjected to framing processing, and the spectral characteristics of a speech frame, such as a Mel Frequency Cepstrum Coefficient (MFCC), are extracted. By calculating the characteristic correlation between frames, a correlation matrix of the music signal can be obtained. The speech signal may be segmented into segments according to a correlation matrix and the segments may be clustered according to the correlation between the segments. After the segmentation and clustering process, for the music signal to be processed, a music structure and corresponding segment time points similar to the a-b-c-b-c form can be obtained. And finally, based on the repetition times of the segments, the acoustic characteristics of volume, brightness and the like, the parts of the music such as the main song, the refrain and the like can be judged.
The three-dimensional metadata generation module can generate corresponding three-dimensional spatial metadata for different audio track signals based on the music beat information and the music structure information. The three-dimensional space metadata corresponds to information of the audio track signal in a three-dimensional space, and specifically mainly includes three-dimensional position coordinates, sound width and the like.
The template of each sound source can be customized by a user according to the understanding of the user on the music content, and the three-dimensional spatial metadata of each sound source can be automatically generated through some preset templates.
For example, the preset template may include templates for moving the distance between the vocal sound source and the listener from far to near in the main song portion, gradually increasing the height of the vocal sound source in the three-dimensional coordinates and the width of the whole sound in the transition section of the main song and the refrain, periodically changing the instrument sound source in a specific trajectory in the three-dimensional space according to the tempo, increasing the spatial rotation speed of the instrument sound source in the refrain portion, and the like. However, the above examples are merely exemplary, and the present disclosure is not limited thereto.
The 3D music signal generated by the spatial audio rendering module may include separate audio track signals and three-dimensional metadata corresponding to each audio track, and a final 3D music effect may be obtained through a spatial audio rendering technique. For example, for a headphone playback device, each audio frame of an input audio track may be convolved with its corresponding HRTF in three-dimensional coordinates based on a Head Related Transfer Function (HRTF), resulting in a corresponding binaural audio signal. For a plurality of speaker playback devices, a unit direction vector of sound may be expressed as a linear combination of unit direction vectors of a plurality of speakers closest to a sound source direction based on a vector magnitude panning technique (VBAP), and a gain factor of each speaker may be calculated, thereby rendering a 3D musical effect for the plurality of speakers.
Fig. 5 is a schematic structural diagram of an audio generation device of a hardware operating environment according to an embodiment of the present disclosure.
As shown in fig. 5, the audio generating apparatus 500 may include: a processing component 501, a communication bus 502, a network interface 503, an input-output interface 504, a memory 505, and a power component 506. Wherein a communication bus 502 is used to enable connective communication between these components. The input-output interface 504 may include a video display (such as a liquid crystal display), a microphone and speakers, and a user-interaction interface (such as a keyboard, mouse, touch-input device, etc.), and optionally, the input-output interface 504 may also include a standard wired interface, a wireless interface. The network interface 503 may optionally include a standard wired interface, a wireless interface (e.g., a wireless fidelity interface). The memory 505 may be a high speed random access memory or may be a stable non-volatile memory. The memory 505 may alternatively be a storage device separate from the processing component 501 described previously.
Those skilled in the art will appreciate that the configuration shown in fig. 5 does not constitute a limitation of the audio generating device 500, and may include more or fewer components than shown, or some components may be combined, or a different arrangement of components.
As shown in fig. 5, the memory 505, which is a kind of storage medium, may include therein an operating system (such as a MAC operating system), a data storage module, a network communication module, a user interface module, a program, and a database.
In the audio generating apparatus 500 shown in fig. 5, the network interface 503 is mainly used for data communication with an external electronic apparatus/terminal; the input/output interface 504 is mainly used for data interaction with a user; the processing component 501 and the memory 505 in the audio generating apparatus 500 may be provided in the audio generating apparatus 500, and the audio generating apparatus 500 performs the audio generating method provided by the embodiment of the present disclosure by the processing component 501 calling the program stored in the memory 505 and various APIs provided by the operating system.
The processing component 501 may include at least one processor, with a set of computer-executable instructions stored in the memory 505 that, when executed by the at least one processor, perform an audio generation method in accordance with embodiments of the present disclosure. However, the above examples are merely exemplary, and the present disclosure is not limited thereto.
The processing component 501 may implement control of the components included in the audio generating device 500 by executing a program.
By way of example, the audio generating device 500 may be a PC computer, tablet device, personal digital assistant, smart phone, or other device capable of executing the set of instructions described above. Here, the audio-generating device 500 need not be a single electronic device, but can be any collection of devices or circuits that can execute the above-described instructions (or sets of instructions), either individually or in combination. The audio generating device 500 may also be part of an integrated control system or system manager, or may be configured as a portable electronic device that interfaces with local or remote (e.g., via wireless transmission).
In the audio generation apparatus 500, the processing component 501 may include a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a programmable logic device, a dedicated processor system, a microcontroller, or a microprocessor. By way of example, and not limitation, processing component 501 may also include an analog processor, a digital processor, a microprocessor, a multi-core processor, a processor array, a network processor, and the like.
The processing component 501 may execute instructions or code stored in a memory, wherein the memory 505 may also store data. Instructions and data may also be sent and received over a network via the network interface 503, where the network interface 503 may employ any known transmission protocol.
Memory 505 may be integrated with processing component 501, for example, by having RAM or flash memory disposed within an integrated circuit microprocessor or the like. Further, memory 505 may comprise a stand-alone device, such as an external disk drive, storage array, or any other storage device that may be used by a database system. The memory and processing components 501 may be operatively coupled or may communicate with each other, such as through I/O ports, network connections, etc., so that the processing components 501 can read data stored in the memory 505
Fig. 6 is a block diagram of an audio generation apparatus according to an embodiment of the present disclosure.
Referring to fig. 6, the audio generating apparatus 600 may include an acquisition module 601, a track separation module 602, a metadata generation module 603, and a rendering module 604. Each module in the audio generating apparatus 600 may be implemented by one or more modules, and names of the corresponding modules may vary according to types of the modules. In various embodiments, some modules in the audio generation apparatus 600 may be omitted, or additional modules may also be included. Furthermore, modules/elements according to various embodiments of the present disclosure may be combined to form a single entity, and thus may equivalently perform the functions of the respective modules/elements prior to combination.
The obtaining module 601 may obtain an audio signal to be processed.
The track separation module 602 may obtain a plurality of track signals for a plurality of sound sources by performing track separation on the audio signal to be processed.
The metadata generation module 603 may determine user orientation information in a three-dimensional space and generate three-dimensional space metadata corresponding to each of the separated plurality of soundtrack signals, wherein the three-dimensional space metadata may include orientation information and sound width information of a corresponding sound source in the three-dimensional space.
Optionally, the metadata generating module 603 may obtain feature information of the audio signal to be processed, where the feature information may include at least one of beat information of the audio signal to be processed and structure information of each audio segment; three-dimensional spatial metadata for each of the audio track signals is generated based on the feature information.
Alternatively, the metadata generating module 603 may obtain beat information by performing audio stuck point detection on the audio signal to be processed; and obtaining the structural information of each audio clip by performing audio structural analysis on the audio signal to be processed.
Alternatively, the metadata generating module 603 may determine, for a human voice signal in the plurality of audio track signals, position adjustment information of a sound source corresponding to the human voice signal with respect to user orientation information in a three-dimensional space according to the type of each audio clip in the structure information, and determine three-dimensional spatial metadata of the human voice signal based on the position adjustment information.
For example, the metadata generation module 603 may set, for a human voice signal among the plurality of track signals, a sound source corresponding to the human voice signal to move in a three-dimensional space toward a position where the user is located during the first preset type of audio clip, and take information related to the movement as three-dimensional spatial metadata of the human voice signal.
For another example, the metadata generation module 603 may increase, for a human voice signal among the plurality of track signals, a height coordinate of a sound source corresponding to the human voice signal in a three-dimensional space to a predetermined height and increase a sound width of the sound source to a predetermined sound width during the second preset type of audio clip, and information related to the increase to the predetermined height and the predetermined sound width may be used as the three-dimensional spatial metadata of the human voice signal.
Alternatively, the metadata generation module 603 may determine, for an instrument signal of the plurality of track signals, movement information of a sound source corresponding to the instrument signal in a three-dimensional space according to the type of each audio piece in the tempo information and the structure information, and determine three-dimensional spatial metadata of the instrument signal based on the movement information.
For example, the metadata generation module 603 may set, for an instrument signal among the plurality of track signals, a sound source corresponding to the instrument signal to periodically move in a three-dimensional space according to the tempo information in accordance with a predetermined trajectory, and take information related to the movement as three-dimensional spatial metadata of the instrument signal.
For another example, the metadata generation module 603 may increase a rotation speed of a sound source corresponding to an instrument signal in a three-dimensional space to a predetermined rotation speed during a second preset type of audio piece as three-dimensional spatial metadata of the instrument signal for the instrument signal among the plurality of track signals.
Alternatively, the metadata generation module 603 may determine a preset template corresponding to each of the audio track signals, respectively, from a plurality of preset templates, where the preset template includes at least one of movement trajectory information, movement speed information, and sound width variation information of the sound source in the three-dimensional space; and respectively generating three-dimensional space metadata for each audio track signal by using the determined preset template.
Alternatively, the metadata generation module 603 may generate three-dimensional spatial metadata for a plurality of audio track signals according to a predetermined template based on the feature information of the audio signal to be processed.
Alternatively, the metadata generation module 603 may acquire setting information input by the user, wherein the setting information includes at least one of a movement trajectory, a movement speed, and a sound width variation value of each sound source corresponding to a plurality of track signals in a three-dimensional space, and generate three-dimensional space metadata for each track signal, respectively, based on the setting information.
For example, the metadata generation module 603 may receive a user input for setting at least one of a movement trajectory, a movement speed, and a sound width variation value of each sound source corresponding to a plurality of track signals in a three-dimensional space; three-dimensional spatial metadata for a plurality of audio track signals is respectively generated based on a user input.
The rendering module 604 may generate a three-dimensional audio signal corresponding to the audio signal to be processed based on the user orientation information, the separated plurality of audio track signals, and the three-dimensional spatial metadata corresponding to each audio track signal.
The rendering module 604 may identify a type of a playback apparatus for playing the three-dimensional audio signal, acquire a rendering policy corresponding to the type of the playback apparatus, and generate a three-dimensional audio signal corresponding to the audio signal to be processed through the acquired rendering policy based on the user orientation information, the separated plurality of audio track signals, and the three-dimensional spatial metadata corresponding to each audio track signal.
For an in-ear playback device, the rendering module 604 may generate, for each of a plurality of audio track signals, a three-dimensional audio signal corresponding to the audio track signal based on the orientation information and the user orientation information corresponding to each audio frame of the audio track signal; the sound width of a three-dimensional audio signal corresponding to a soundtrack signal is adjusted based on the sound width information of each audio frame.
For an external playback device, the rendering module 604 may render the track signal based on the orientation information of the sound source corresponding to the track signal and the orientation information of the plurality of speakers for each of the plurality of track signals to generate a three-dimensional audio signal corresponding to the track signal; the sound width of a three-dimensional audio signal corresponding to a soundtrack signal is adjusted based on sound width information of a sound source.
The manner of converting a conventional stereo signal into a three-dimensional audio signal has been described in detail above with reference to fig. 1 to 4, and will not be described here.
According to an embodiment of the present disclosure, an electronic device may be provided. Fig. 7 is a block diagram of an electronic device 700 that may include at least one memory 702 and at least one processor 701, the at least one memory 702 storing a set of computer-executable instructions that, when executed by the at least one processor 701, perform a method of audio generation according to an embodiment of the disclosure, according to an embodiment of the disclosure.
The processor 701 may include a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a programmable logic device, a special-purpose processor system, a microcontroller, or a microprocessor. By way of example, and not limitation, processor 701 may also include an analog processor, a digital processor, a microprocessor, a multi-core processor, a processor array, a network processor, or the like.
The memory 702, which is a storage medium, may include an operating system, a data storage module, a network communication module, a user interface module, a program for performing the audio generation method of the present disclosure, and a database.
The memory 702 may be integrated with the processor 701, for example, RAM or flash memory may be disposed within an integrated circuit microprocessor or the like. Further, memory 702 may comprise a stand-alone device, such as an external disk drive, storage array, or any other storage device usable by a database system. The memory 702 and the processor 701 may be operatively coupled or may communicate with each other, such as through I/O ports, network connections, etc., so that the processor 701 can read files stored in the memory 702.
In addition, the electronic device 700 may also include a video display (such as a liquid crystal display) and a user interaction interface (such as a keyboard, mouse, touch input device, etc.). All components of the electronic device 700 may be connected to each other via a bus and/or a network.
Those skilled in the art will appreciate that the configuration shown in FIG. 7 is not intended to be limiting and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.
According to an embodiment of the present disclosure, there may also be provided a computer-readable storage medium storing instructions that, when executed by at least one processor, cause the at least one processor to perform an audio generation method according to the present disclosure. Examples of the computer-readable storage medium herein include: read-only memory (ROM), random-access programmable read-only memory (PROM), electrically erasable programmable read-only memory (EEPROM), random-access memory (RAM), dynamic random-access memory (DRAM), static random-access memory (SRAM), flash memory, non-volatile memory, CD-ROM, CD-R, CD + R, CD-RW, CD + RW, DVD-ROM, DVD-R, DVD + R, DVD-RW, DVD + RW, DVD-RAM, BD-ROM, BD-R, BD-R LTH, BD-RE, Blu-ray or optical disk memory, Hard Disk Drive (HDD), solid-state disk drive (SSD), card-type memory (such as a multimedia card, a Secure Digital (SD) card or an extreme digital (XD) card), tape, a floppy disk, a magneto-optical data storage device, an optical data storage device, a hard disk, a magnetic tape, a magneto-optical data storage device, a hard disk, a magnetic tape, a magnetic data storage device, a magnetic tape, a magnetic data storage device, a magnetic tape, a magnetic data storage device, a magnetic tape, a magnetic data storage device, a magnetic tape, a magnetic data storage device, a magnetic tape, a magnetic data storage device, a magnetic disk, a magnetic data storage device, a magnetic disk, A solid state disk, and any other device configured to store and provide a computer program and any associated data, data files, and data structures to a processor or computer in a non-transitory manner such that the processor or computer can execute the computer program. The computer program in the computer-readable storage medium described above can be run in an environment deployed in a computer apparatus, such as a client, a host, a proxy device, a server, and the like, and further, in one example, the computer program and any associated data, data files, and data structures are distributed across a networked computer system such that the computer program and any associated data, data files, and data structures are stored, accessed, and executed in a distributed fashion by one or more processors or computers.
According to an embodiment of the present disclosure, there may also be provided a computer program product, in which instructions are executable by a processor of a computer device to perform the above-described audio generation method.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims (10)

1. A method of audio generation, the method comprising:
acquiring an audio signal to be processed;
obtaining a plurality of soundtrack signals for a plurality of sound sources by soundtrack separation of the audio signal to be processed;
determining user orientation information in a three-dimensional space, and generating three-dimensional space metadata corresponding to each of the plurality of soundtrack signals, wherein the three-dimensional space metadata includes orientation information and sound width information of a corresponding sound source in the three-dimensional space;
generating a three-dimensional audio signal corresponding to the audio signal to be processed based on the user orientation information, the separated plurality of audio track signals, and the three-dimensional spatial metadata corresponding to each of the audio track signals.
2. The method of claim 1, wherein generating three-dimensional spatial metadata corresponding to each soundtrack signal of the plurality of soundtrack signals comprises:
acquiring feature information of the audio signal to be processed, wherein the feature information includes at least one of beat information and structure information of the audio signal to be processed, and the structure information includes type information of each audio clip of the audio signal to be processed;
and respectively generating three-dimensional space metadata corresponding to each audio track signal based on the characteristic information.
3. The method according to claim 2, wherein generating three-dimensional spatial metadata corresponding to each of the audio track signals respectively based on the feature information comprises:
for a human voice signal in the plurality of audio track signals, according to the type of each audio clip in the structure information, determining position adjustment information of a sound source corresponding to the human voice signal relative to the user orientation information in a three-dimensional space, and determining three-dimensional space metadata of the human voice signal based on the position adjustment information.
4. The method according to claim 2, wherein generating three-dimensional spatial metadata corresponding to each of the audio track signals respectively based on the feature information comprises:
and for the musical instrument signals in the plurality of music track signals, determining the movement information of the sound source corresponding to the musical instrument signals in the three-dimensional space according to the beat information and the types of the audio fragments in the structure information, and determining the three-dimensional space metadata of the musical instrument signals based on the movement information.
5. The method of claim 1, wherein generating three-dimensional spatial metadata corresponding to each soundtrack signal of the plurality of soundtrack signals comprises:
determining a preset template corresponding to each audio track signal from a plurality of preset templates, wherein the preset template comprises at least one of movement track information, movement speed information and sound width change information of a corresponding sound source in a three-dimensional space;
and respectively generating three-dimensional space metadata for each audio track signal by using the determined preset template.
6. The method of claim 1, wherein generating three-dimensional spatial metadata corresponding to each soundtrack signal of the plurality of soundtrack signals comprises:
acquiring setting information input by a user, wherein the setting information comprises at least one of a moving track, a moving speed and a sound width variation value of each sound source corresponding to the plurality of sound track signals in a three-dimensional space;
generating three-dimensional spatial metadata for each of the track signals, respectively, based on the setting information.
7. An apparatus for audio generation, the apparatus comprising:
an acquisition module configured to acquire an audio signal to be processed;
a sound track separation module configured to obtain a plurality of sound track signals for a plurality of sound sources by performing sound track separation on the audio signal to be processed;
a metadata generation module configured to determine user orientation information in a three-dimensional space and generate three-dimensional space metadata corresponding to each of the plurality of soundtrack signals, wherein the three-dimensional space metadata includes orientation information and sound width information of a corresponding sound source in the three-dimensional space;
a rendering module configured to generate a three-dimensional audio signal corresponding to the audio signal to be processed based on the user orientation information, the separated plurality of audio track signals, and the three-dimensional spatial metadata corresponding to each of the audio track signals.
8. An electronic device, comprising:
at least one processor;
at least one memory storing computer-executable instructions,
wherein the computer-executable instructions, when executed by the at least one processor, cause the at least one processor to perform the audio generation method of any of claims 1 to 6.
9. A computer-readable storage medium storing instructions that, when executed by at least one processor, cause the at least one processor to perform the audio generation method of any of claims 1 to 6.
10. A computer program product in which instructions are executed by at least one processor in an electronic device to perform the audio generation method of any of claims 1 to 6.
CN202210448723.2A 2022-04-26 2022-04-26 Audio generation method and device, electronic equipment and storage medium Pending CN114827886A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210448723.2A CN114827886A (en) 2022-04-26 2022-04-26 Audio generation method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210448723.2A CN114827886A (en) 2022-04-26 2022-04-26 Audio generation method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN114827886A true CN114827886A (en) 2022-07-29

Family

ID=82508081

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210448723.2A Pending CN114827886A (en) 2022-04-26 2022-04-26 Audio generation method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN114827886A (en)

Similar Documents

Publication Publication Date Title
US10311881B2 (en) Determining the inter-channel time difference of a multi-channel audio signal
EP3011762B1 (en) Adaptive audio content generation
CN111540374A (en) Method and device for extracting accompaniment and voice, and method and device for generating word-by-word lyrics
CN112205006A (en) Adaptive remixing of audio content
WO2023040520A1 (en) Method and apparatus for performing music matching of video, and computer device and storage medium
CN111724757A (en) Audio data processing method and related product
US20230254655A1 (en) Signal processing apparatus and method, and program
US20230186782A1 (en) Electronic device, method and computer program
US11463833B2 (en) Method and apparatus for voice or sound activity detection for spatial audio
EP3895164B1 (en) Method of decoding audio content, decoder for decoding audio content, and corresponding computer program
CN113287169A (en) Apparatus, method and computer program for blind source separation and remixing
CN113747337B (en) Audio processing method, medium, device and computing equipment
CN114827886A (en) Audio generation method and device, electronic equipment and storage medium
Joseph et al. Cycle GAN-Based Audio Source Separation Using Time–Frequency Masking
Barry Real-time sound source separation for music applications
CN114822492B (en) Speech synthesis method and device, electronic equipment and computer readable storage medium
WO2022023130A1 (en) Multiple percussive sources separation for remixing.
CN115132222A (en) Training method and device of audio track separation model and audio track separation method and device
CN114566191A (en) Sound correcting method for recording and related device
Deboosere Isolating the singing voice from music tracks: a deep neural
Cano et al. Selective Hearing: A Machine Listening Perspective
CN114677995A (en) Audio processing method and device, electronic equipment and storage medium
AU2022370166A1 (en) Generating tonally compatible, synchronized neural beats for digital audio files
CN116847272A (en) Audio processing method and related equipment
CN117896666A (en) Method for playback of audio data, electronic device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination