CN116312425A - Audio adjustment method, computer device and program product - Google Patents

Audio adjustment method, computer device and program product Download PDF

Info

Publication number
CN116312425A
CN116312425A CN202211090743.3A CN202211090743A CN116312425A CN 116312425 A CN116312425 A CN 116312425A CN 202211090743 A CN202211090743 A CN 202211090743A CN 116312425 A CN116312425 A CN 116312425A
Authority
CN
China
Prior art keywords
audio
information
template
target
pitch
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211090743.3A
Other languages
Chinese (zh)
Inventor
张超鹏
吴逸龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Music Entertainment Technology Shenzhen Co Ltd
Original Assignee
Tencent Music Entertainment Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Music Entertainment Technology Shenzhen Co Ltd filed Critical Tencent Music Entertainment Technology Shenzhen Co Ltd
Priority to CN202211090743.3A priority Critical patent/CN116312425A/en
Publication of CN116312425A publication Critical patent/CN116312425A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H1/00Details of electrophonic musical instruments
    • G10H1/0033Recording/reproducing or transmission of music for electrophonic musical instruments
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H1/00Details of electrophonic musical instruments
    • G10H1/02Means for controlling the tone frequencies, e.g. attack or decay; Means for producing special musical effects, e.g. vibratos or glissandos
    • G10H1/06Circuits for establishing the harmonic content of tones, or other arrangements for changing the tone colour
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H1/00Details of electrophonic musical instruments
    • G10H1/36Accompaniment arrangements
    • G10H1/361Recording/reproducing of accompaniment for use with an external source, e.g. karaoke systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H1/00Details of electrophonic musical instruments
    • G10H1/36Accompaniment arrangements
    • G10H1/40Rhythm
    • G10H1/42Rhythm comprising tone forming circuits

Abstract

The application relates to an audio adjustment method, a computer device and a computer program product. The voice frequency is recorded through selecting an audio template, the duration of fundamental frequency information corresponding to the voice frequency is adjusted according to the duration information of the audio part corresponding to each text word in the audio template, target fundamental frequency information matched with the duration information is obtained, pitch adjustment is carried out on the target fundamental frequency information according to template pitch information, the adjusted target voice frequency is determined based on the target fundamental frequency information after pitch adjustment, and then fusion processing is carried out on the target voice frequency and template accompaniment, so that the adjusted target voice frequency is obtained. Compared with the traditional mode of manually editing the audio in a plurality of time periods, the method improves the audio adjusting effect when the ghost audio is adjusted by performing processing including time length adjustment, pitch adjustment, fusion and the like on the human voice audio based on the audio template.

Description

Audio adjustment method, computer device and program product
Technical Field
The present invention relates to the field of audio processing technology, and in particular, to an audio adjustment method, an apparatus, a computer device, a storage medium, and a computer program product.
Background
With the development of computer technology, audio listening and processing, such as listening and processing songs, etc., have been performed by various terminal devices. With the development of audio editing technology, the development of sound and video of the live stock is gradually promoted, and the live stock is to utilize sound in materials to carry out editing, piecing and tuning and combine with song accompaniment so as to obtain a complete audio and video work. Thus, when a user needs to obtain a live animal audio, the audio needs to be adjusted. The current method for adjusting the audio and generating the ghost audio is to manually clip the audio in a plurality of time periods to obtain an adjusted ghost audio. However, by adjusting the audio by means of the period clips, only a simple repetitive effect on the audio can be achieved.
Therefore, the current audio adjusting method has the defect of insufficient adjusting effect.
Disclosure of Invention
In view of the foregoing, it is desirable to provide an audio adjusting method, apparatus, computer device, computer-readable storage medium, and computer program product that can improve the adjusting effect.
In a first aspect, the present application provides an audio adjustment method, the method comprising:
Selecting an audio template, and recording voice audio corresponding to the audio template;
acquiring fundamental frequency information corresponding to the voice audio and identifying a text and a time stamp corresponding to a text word in the voice audio;
determining the fundamental frequency information corresponding to the text word based on the fundamental frequency information and the timestamp corresponding to the text word;
adjusting the fundamental frequency duration of the Chinese character in the voice audio based on the duration information of the audio part corresponding to each text character in the audio template to obtain target fundamental frequency information;
according to the pitch information corresponding to each text word of the audio template and the pitch change trend of the audio template, performing pitch adjustment on the target fundamental frequency information, and determining adjusted target voice audio;
and acquiring a template accompaniment corresponding to the audio template, and carrying out fusion processing on the target voice audio and the template accompaniment to obtain adjusted target audio.
In one embodiment, the selecting an audio template, recording voice audio corresponding to the template, includes:
displaying at least one audio template to be selected;
receiving a selection instruction of the at least one audio template to be selected, and determining the selected audio template;
And displaying the text corresponding to the audio template, and recording the voice audio input by the user based on the text corresponding to the audio template.
In one embodiment, the identifying the timestamp corresponding to the text and the text word in the voice audio includes:
identifying an original text corresponding to the voice audio according to the fundamental frequency information;
modifying the original text according to a matching result of the original text and the text corresponding to the audio template to obtain the text corresponding to the voice audio; each text word in the text corresponding to the voice audio is matched with each text word in the text corresponding to the audio template;
and determining the time stamp of each text word in the voice audio according to the duration time of the corresponding audio frequency of each text word in the text corresponding to the voice audio frequency.
In one embodiment, the acquiring the fundamental frequency information corresponding to the voice audio includes:
and acquiring pitch information, tone information and sounding characteristic information corresponding to the voice audio as fundamental frequency information.
In one embodiment, the adjusting the fundamental frequency duration of the Chinese character in the voice audio based on the duration information of the audio part corresponding to each text character in the audio template to obtain the target fundamental frequency information includes:
Acquiring pitch information, tone information and sounding characteristic information of each text word in the voice audio and an audio part of a corresponding text word in the audio template, and taking the pitch information, the tone information and the sounding characteristic information as fundamental frequency information corresponding to each text word in the voice audio;
respectively performing one-dimensional linear interpolation processing on pitch information, tone information and sounding characteristic information of each text word of the voice audio according to the duration information of the audio part corresponding to each text word in the audio template, so that the duration of the pitch information, the duration of the tone information and the duration of the sounding characteristic information are respectively matched with the duration information of the audio part corresponding to each text word in the audio template;
and taking the target pitch information, the target tone information and the target sounding characteristic information after interpolation processing as target fundamental frequency information.
In one embodiment, the adjusting the pitch of the target fundamental frequency information according to the pitch information corresponding to each text word of the audio template and the pitch variation trend of the audio template includes:
acquiring target pitch information in the target fundamental frequency information, and acquiring partial target pitch information corresponding to each text word in the voice audio in the target pitch information;
Aiming at each part of target pitch information, acquiring a preset number of adjacent frames adjacent to each frame in the part of target pitch information;
according to the average value of the partial target pitch information of each frame and the adjacent partial target pitch information of the adjacent frame corresponding to each frame, obtaining average pitch information corresponding to the partial target pitch information of each frame; the average pitch information characterizes the pitch variation trend of part of target pitch information of each frame;
and according to the average pitch information and the pitch information corresponding to each text word in the audio template, performing pitch adjustment on the target pitch information of the part in the target audio feature.
In one embodiment, the audio template further comprises: a reference pitch and a reference frequency corresponding to the reference pitch; the obtaining the pitch information, tone information and sounding characteristic information corresponding to the voice audio comprises the following steps:
acquiring text words corresponding to each character of the audio template in the text of the voice audio;
acquiring partial voice audios corresponding to each text word in the voice audios;
aiming at the partial voice audio corresponding to each text word, acquiring partial fundamental frequency corresponding to the partial voice audio; according to the partial fundamental frequency corresponding to the partial voice frequency and the reference frequency, determining a pitch offset value of pitch information corresponding to the partial voice frequency, and according to the reference pitch and the pitch offset value, determining the pitch information corresponding to the partial voice frequency;
Constructing an envelope matrix according to envelope vectors of a preset number of frequency points in the part of voice audio to obtain tone information;
and obtaining the sounding characteristic information according to the non-periodic information in the frequency bands of the preset number in the part of the voice audio.
In one embodiment, the determining the adjusted target voice audio includes:
determining a frequency adjustment value of the target pitch information after pitch adjustment according to the target pitch information after pitch adjustment and the reference pitch of the audio template;
according to the reference frequency of the audio template and the frequency adjustment value, determining an adjusted target frequency corresponding to the voice audio;
and determining the adjusted target voice audio according to the target frequency, the target tone information and the target sounding characteristic information.
In one embodiment, the fusing the target voice audio and the template accompaniment to obtain the adjusted target audio includes:
acquiring a template beat of the template accompaniment;
matching the target voice audio with the template accompaniment according to the template beat;
and mixing the matched audio to obtain the adjusted target audio.
In one embodiment, the obtaining the template beat of the template accompaniment includes:
acquiring audio energy values of all time points in the template accompaniment, and taking the time point corresponding to the audio energy value larger than a preset energy threshold as a re-shooting time stamp of the template accompaniment;
and determining the template beat of the template accompaniment according to the plurality of the re-beat time stamps.
In one embodiment, the mixing processing is performed on the matched audio to obtain the adjusted target audio, including:
according to the audio energy of the template accompaniment in the matched audio, adjusting the audio energy of the target voice audio in the matched audio so that the audio energy of the target voice audio is smaller than the audio energy of the template accompaniment;
and carrying out superposition and mixing processing on the target voice audio after the audio energy adjustment and the template accompaniment to obtain adjusted target audio.
In a second aspect, the present application provides an audio adjustment apparatus, the apparatus comprising:
the recording module is used for selecting an audio template and recording voice audio corresponding to the audio template;
the recognition module is used for acquiring the fundamental frequency information corresponding to the voice audio and recognizing the text and the time stamp corresponding to the text word in the voice audio;
The determining module is used for determining the fundamental frequency information corresponding to the text word based on the fundamental frequency information and the timestamp corresponding to the text word;
the matching module is used for adjusting the fundamental frequency duration of the Chinese characters in the voice audio based on the duration information of the audio part corresponding to each text character in the audio template to obtain target fundamental frequency information;
the adjusting module is used for adjusting the pitch of the target fundamental frequency information according to the pitch information corresponding to each text word of the audio template and the pitch change trend of the audio template, and determining adjusted target voice audio;
and the fusion module is used for acquiring the template accompaniment corresponding to the audio template, and carrying out fusion processing on the target voice audio and the template accompaniment to obtain the adjusted target audio.
In a third aspect, the present application provides a computer device comprising a memory storing a computer program and a processor implementing the steps of the method described above when the processor executes the computer program.
In a fourth aspect, the present application provides a computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of the method described above.
In a fifth aspect, the present application provides a computer program product comprising a computer program which, when executed by a processor, implements the steps of the method described above.
According to the audio adjusting method, the device, the computer equipment, the storage medium and the computer program product, through recording voice audio after selecting an audio template, fundamental frequency information corresponding to the voice audio is obtained, text in the voice audio is identified, fundamental frequency information corresponding to text words is determined based on the fundamental frequency and a time stamp corresponding to the text words, fundamental frequency duration of Chinese words in the voice audio is obtained based on duration information pieces of audio parts corresponding to each text word in the audio template, target fundamental frequency information is obtained, pitch adjustment is carried out on the target fundamental frequency information according to pitch information and change trend of each text word, adjusted target voice audio is determined, and then fusion processing is carried out on the target voice audio and template accompaniment to obtain adjusted target audio. Compared with the traditional mode of manually editing the audio in a plurality of time periods, the method improves the audio adjusting effect when the ghost audio is adjusted by performing processing including time length adjustment, pitch adjustment, fusion and the like on the human voice audio based on the audio template.
Drawings
FIG. 1 is a flow chart of an audio adjustment method according to an embodiment;
FIG. 2 is a flowchart illustrating a human voice frequency adjustment process according to one embodiment;
FIG. 3 is a flow chart of an interpolation step in one embodiment;
FIG. 4 is a schematic diagram of a pitch interpolation step in one embodiment;
FIG. 5 is a schematic diagram illustrating interpolation steps of tone and voicing characteristics in an embodiment;
FIG. 6 is a schematic diagram of a pitch adjustment step in one embodiment;
FIG. 7 is a flow chart of an audio adjustment method according to another embodiment;
FIG. 8 is a flow chart of an audio adjustment method according to another embodiment;
FIG. 9 is a block diagram of an audio adjustment device in one embodiment;
fig. 10 is an internal structural view of a computer device in one embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.
In one embodiment, as shown in fig. 1, an audio adjustment method is provided, where the method is applied to a terminal to illustrate the method, it is understood that the method may also be applied to a server, and may also be applied to a system including the terminal and the server, and implemented through interaction between the terminal and the server, and includes the following steps:
Step S202, selecting an audio template and recording voice audio corresponding to the audio template.
The audio template can be a template for carrying out audio adjustment on voice audio, the audio template can be constructed in advance, the audio template comprises texts corresponding to the template, and the texts comprise a plurality of text words. The audio template further comprises template pitch information corresponding to each character in the text of the template, duration information of an audio part corresponding to each text word in the text of the template and the like. The template pitch information may include pitch information of each character in the text of the audio template, and the pitch information of each character may be determined according to a numbered musical note corresponding to each character in the audio template. The voice audio may be a voice recorded by the user through the terminal. For example, the user may click a related button in the terminal to open the sound recording function of the live stock, at this time, the terminal may respond to the selection of the user and display at least one audio template to be selected, the user may select a corresponding audio template, after the user selects the corresponding template, the terminal may receive a selection instruction for the at least one audio template to be selected, and determine the selected audio template, so that the terminal may display a text corresponding to the selected audio template, the user may record corresponding voice audio according to the text, and the terminal may record voice audio input by the user based on the text corresponding to the audio template.
Step S204, the fundamental frequency information corresponding to the voice audio is obtained, and the time stamp corresponding to the text and the text word in the voice audio is identified.
The voice audio can be voice information recorded by a user, is audio to be adjusted, and the terminal can recognize text information contained in the voice audio. Namely, the terminal can perform voice recognition on the voice audio and recognize text information in the voice audio. The voice audio may contain noise of other non-voice, so that the terminal can obtain text in the voice audio by extracting fundamental frequency information in the voice audio and identifying the fundamental frequency information.
In addition, in some embodiments, the user may input the voice audio first, and the terminal searches for a corresponding audio template based on the voice audio, for example, after the user inputs the voice audio, the terminal may query the template library according to the voice audio, and when detecting that the text in the audio template exists in the template library and corresponds to the voice audio, may output the audio template as a query result. The text in the voice input by the user may be the same as the number of characters of the text in each audio template contained in the template library and the specific pronunciation of each character, or the same as the number of characters of the text in the audio template. The terminal can determine whether to query the audio templates according to the number of characters and the specific character form or the number of characters according to the input condition.
In addition, the audio template may further include a template tone and a template beat. The template tones may be reference tones of the audio template, for example, tones such as "CDEFGAB", and each template tone may correspond to a reference pitch. The audio template can be a ghost template mainly composed of beats and fundamental frequency sequences, and the ghost template is formed by utilizing sounds (speaking and sound effects) in materials to carry out editing, piecing and tuning and combining with song accompaniment so as to obtain a complete audio and video work, which can be songs such as popular songs, electric tones, rap and the like. Each audio template may be regarded as information about a preset song. Each preset song has a tone, the tone pitch value corresponding to the text in the audio template may be the tone pitch value corresponding to the tone of the preset song, for example, the tone pitch value corresponding to the tone F is 65, the template beat is beat information of the preset song, the text of the audio template is text information of part or all of lyrics in the preset song, and the duration information of the audio part corresponding to each text word in the text of the audio template may be the duration of the audio part corresponding to the audio of the original template. For example, the audio template may correspond to a preset song, the text of the audio template is part or all of lyrics in the preset song, the terminal may match the lyrics of the text of the audio template with the preset song corresponding to the text of the audio template, and determine the duration of the audio portion corresponding to the text of the audio template, so as to obtain duration information of the audio portion corresponding to each text word in the text of the audio template.
When the terminal acquires the text in the voice audio, the text can be obtained by identifying each character in the voice audio. For example, after the terminal acquires the voice audio, the terminal may first acquire the fundamental frequency information in the voice audio, and identify the original text corresponding to the voice audio according to the fundamental frequency information. The original text may be text which is not checked for correctness, and because the user may have the conditions of misprinting and missing recording when inputting the voice audio, for example, the user turns over a certain two words back and forth or forgets to speak a certain word, etc., the terminal may perform correctness detection on the original text, for example, the terminal may match the text corresponding to the original text and the audio template to obtain a matching result, and the terminal may modify the original text according to the matching result, and correct or supplement the error text word and the missing text word therein, thereby obtaining the text corresponding to the voice audio. I.e. each text word in the text corresponding to the voice audio matches each text word of the text corresponding to the audio template. In addition, the text corresponding to the voice audio may include a plurality of text words, each text word occupies a certain amount of audio in the voice audio, and after the terminal identifies the text of the voice audio, the terminal may determine the timestamp of each text word in the voice audio according to the duration of the audio corresponding to each text word in the text corresponding to the voice audio in the voice audio. Wherein, the text of the voice audio comprises at least one text word and a time stamp of each text word in the at least one text word, and the time stamp can be the appearance time and the ending time of each text word in the voice audio.
Specifically, the voice audio can be a section of recording recorded by the user, the recording contains lyric information sung by the user, and the terminal can identify and segment the lyric information in the voice audio. The terminal can identify the starting time stamp and the ending time stamp of each character in the voice audio input by the user through a voice recognition tool, and extract the fundamental frequency information in the voice audio of the user by utilizing an open source tool pYin, crepe and the like, specifically, the terminal can determine the audio part of each character in the voice audio based on the starting time stamp and the ending time stamp of each character in the voice audio, and further extract the fundamental frequency of the audio part of each character respectively to obtain the fundamental frequency information corresponding to each character. When the user inputs the voice audio, the voice input can be performed according to the text of the set audio template, for example, the text of the audio template is "good thinking about playing", the user can input the voice containing "good thinking about playing" as the voice audio, and the terminal can recognize the character time stamp corresponding to each character in the "good thinking about playing" in the voice audio through the voice recognition tool, so as to determine the audio part corresponding to each character, and the audio part of each character in the five words is obtained. After determining the audio portion corresponding to each character, the terminal may extract the fundamental frequency information of the voice audio in the voice audio, so as to obtain the fundamental frequency information corresponding to each character, for example, the fundamental frequency information of each character in the five characters. Therefore, the terminal can obtain the voice audio according to the fundamental frequency information corresponding to each character and the character time stamp corresponding to each fundamental frequency information. After the terminal performs lyric recognition, segmentation and fundamental frequency extraction on the voice audio, a melody curve of the user dry voice and a lyric timestamp position can be obtained, and the terminal can also determine a target melody curve and a target lyric position by acquiring a corresponding audio template. The target melody curve can be determined according to pitch information in the audio template, and the target lyric position can be determined according to duration information of an audio part corresponding to each text word in the text of the audio template in the audio template. The terminal can adjust the human voice audio in the human voice audio input by the user to be consistent with the melody in the audio template, so that the audio adjustment of the ghost audio is realized.
Step S206, the base frequency information corresponding to the text word is determined based on the base frequency information and the time stamp corresponding to the text word.
The base frequency information may be integral base frequency information of voice audio, corresponding text may be corresponding to the voice audio, and the text includes a plurality of text words, and the terminal may determine the base frequency information corresponding to each text word based on the base frequency information and the identified timestamp of each text word. For example, the terminal may use the baseband information of the corresponding time portion as the baseband information of each text word according to the above identified time stamp of each text word. The terminal can adjust the sound frequency of the ghost animal by adjusting the fundamental frequency information of each text word.
Step S208, the fundamental frequency duration of the Chinese characters in the voice audio is adjusted based on the duration information of the audio part corresponding to each text character in the audio template, and the target fundamental frequency information is obtained.
The duration information may be duration information of an audio portion corresponding to each text word in the text of the audio template, and the duration information may be determined based on a timestamp of each text word, and may specifically be duration information of an audio portion corresponding to the text in the voice audio. Because the duration of the voice audio input by the user is not necessarily consistent with the duration information in the audio template, the terminal needs to adjust the duration of the relevant feature information of the voice audio. The terminal can acquire the duration information of the audio part corresponding to the text in the audio template, and acquire the fundamental frequency information of the audio part corresponding to each text word in the text of the audio template in the voice audio, and the terminal can adjust the fundamental frequency information of the audio part corresponding to each text word in the voice audio according to the duration information of the audio part corresponding to each text word in the text of the audio template, so as to acquire target fundamental frequency information, and the duration of the target fundamental frequency information is matched with the duration of the text of the audio template. I.e. the terminal needs to perform time domain expansion and contraction on the fundamental frequency information of the voice audio.
The fundamental frequency information of the voice audio can comprise various characteristic information. For example, the terminal may acquire pitch information, tone information, and sounding characteristic information corresponding to the above-described human voice audio as the fundamental frequency information. And the terminal can adjust the fundamental frequency duration based on the various characteristic information to obtain target fundamental frequency information. For example, in one embodiment, adjusting the fundamental frequency duration of the Chinese character in the voice audio based on the duration information of the audio portion corresponding to each text word in the audio template, to obtain the target fundamental frequency information includes: acquiring pitch information, tone information and sounding characteristic information of each text word in the voice audio and an audio part of a corresponding text word in an audio template, and taking the pitch information, the tone information and the sounding characteristic information as fundamental frequency information corresponding to each text word in the voice audio; respectively performing one-dimensional linear interpolation processing on pitch information, tone information and sounding characteristic information of each text word of the voice audio according to the duration information of the audio part corresponding to each text word in the audio template, so that the duration of the pitch information, the duration of the tone information and the duration of the sounding characteristic information are respectively matched with the duration information of the audio part corresponding to each text word in the audio template; and taking the target pitch information, the target tone information and the target sounding characteristic information after interpolation processing as target fundamental frequency information. In this embodiment, the voice audio may include the base frequency information and a character time stamp corresponding to each character in the base frequency information. Fig. 2 is a flowchart of the audio adjustment step of fig. 2 according to an embodiment. The terminal may acquire pitch information, tone information, and sounding characteristic information of an audio portion corresponding to each text word in the text of the audio template from a dry sound segment, that is, the above-mentioned human voice audio, as the fundamental frequency information. The pitch information may be tone information of a sound of the user, the tone information may be tone of the sound of the user, the sound feature information may be information which cannot be displayed in a base frequency but is actually sound of the user, for example, the user emits a light sound, the base frequency of the light sound is not displayed in the base frequency information, but the light sounds emitted by the user are required to be represented, so that the terminal needs to acquire sound features, the user has sound features when emitting each sound, and the sound feature information may exist in the whole process of the voice audio.
The pitch information can be obtained according to the fundamental frequency of the voice audio, namely, the terminal can determine the pitch information based on the fundamental frequency extraction of the voice audio; for tone information, the terminal can obtain according to an envelope matrix in the voice audio; for sounding feature information, the terminal may obtain the aperiodic information. After the terminal obtains the pitch information, the tone information and the sounding characteristic information, the pitch information, the tone information and the sounding characteristic information corresponding to the voice audio can be adjusted according to the duration information of the audio part corresponding to each text word in the text of the audio template, and target pitch information, target tone information and target sounding characteristic information matched with the duration information of the text of the audio template are obtained. The terminal may use the target pitch information, the target timbre information, and the target voicing characteristic information as target fundamental frequency information. The pitch information, the tone information and the sounding characteristic information can be in a sequence form, and the terminal can perform time domain expansion and contraction on the pitch information, the tone information and the sounding characteristic information, so that target pitch information, target tone information and target sounding characteristic information matched with the duration information of the text of the audio template are obtained.
Specifically, the terminal may match the pitch information, the timbre information, and the duration information of the sounding feature information corresponding to the voice audio with the duration information of the audio template. The terminal can respectively conduct one-dimensional linear interpolation processing on the pitch information, the tone information and the sounding characteristic information according to the duration information of the audio part corresponding to each text word in the text of the audio template. And the duration of the pitch information, the duration of the tone information and the duration of the sounding characteristic information are respectively matched with the duration information of the audio part corresponding to each text word in the text of the audio template. That is, the terminal needs to perform expansion and contraction on the time domain of the pitch information, the tone information and the sounding characteristic information, so that the terminal can use the target pitch information, the target tone information and the target sounding characteristic information after interpolation processing as the target fundamental frequency information. Specifically, as shown in fig. 2, the terminal may perform vector interpolation on pitch information corresponding to the above fundamental frequency information based on duration information of an audio portion corresponding to each text word in a text of the audio template in the audio template, and take the interpolated pitch information sequence as target pitch information, where the interpolated pitch sequence may be expressed as: c (M) =interp1 (c (n)), where c (M) represents a pitch information sequence with a length of M after interpolation processing, m=0, 1,2, …, M; interp1 () represents one-dimensional linear interpolation processing, and spline, cubic, linear or the like can be used for the interpolation curve.
As shown in fig. 2, the terminal needs to perform matrix interpolation on the tone information and the sounding feature information, but only needs to perform temporal expansion and contraction on the tone information and the sounding feature information, so the terminal can obtain a matrix interpolation result by using linear interpolation of each-dimensional signals. Specifically, the terminal may represent the interpolation result of tone information by the following formula: e (m, k) 0 )=interp1(e(n,k 0 )). Wherein e is (m, k 0 ) Representing the mth output frame k 0 And on each frequency point, enveloping the interpolation result of the signal. Specifically, as shown in fig. 3, fig. 3 is a schematic flow chart of the interpolation step in one embodiment. Through the interpolation processing, the duration time of the tone color information can be changed from the duration time N to the duration time M, so that time expansion and contraction are realized, and the tone color information is matched with the duration time in the audio template. For the vocal feature interpolation process, the specific formula can be as follows: a (n, i) 0 )=interp1(a(n,i 0 )). Wherein a is (n, i 0 ) Representing the ith in the nth output frame 0 And the sound production characteristic information sequence after interpolation on each frequency band. The terminal obtains target fundamental frequency information matched with the duration of the audio template after interpolation processing by performing linear interpolation processing on each fundamental frequency information, so that the terminal can perform audio adjustment based on the target fundamental frequency information, and the adjustment effect on the sound adjustment of the live stocks is improved.
Specifically, as shown in fig. 4, fig. 4 is a schematic diagram illustrating a pitch interpolation step in one embodiment. Taking the example that the text of the audio template is "good thinking to play", fig. 4 may be a comparison of pitch curves before and after interpolation of the "good" words. Wherein, the abscissa is time information, the ordinate is pitch value, the curve 401 is the original pitch of the fundamental frequency information corresponding to the "good" word before interpolation in the fundamental frequency information corresponding to the voice frequency, the time length of the fundamental frequency information is not matched with the duration of the corresponding character in the text of the audio template, and the pitch curve 403 matched with the duration of the corresponding character in the text of the audio template can be obtained after the interpolation processing.
In addition, the terminal can also perform interpolation processing on the tone information and the sounding characteristic information in the fundamental frequency information in the time domain. FIG. 5 is a schematic diagram of interpolation steps of tone color and voicing characteristics in one embodiment, as shown in FIG. 5. Fig. 5 may be a comparison of tone color and voicing characteristics before and after interpolation of "good" words. In each coordinate axis in fig. 5, the abscissa represents time, the ordinate of the tone characteristic represents frequency information, and the ordinate of the sounding characteristic represents frequency band information. 601 is the original tone characteristic corresponding to the "good" word in the above fundamental frequency information, it can be known that the time length information thereof is not matched with the time length of the corresponding character in the text of the audio template, and 603 can be obtained after interpolation processing, and only the expansion and contraction processing is performed in the time domain, so that the time length of the corresponding character in the text of the audio template can be matched with the time length of the corresponding character in the text of the audio template. 605 may be the original sounding feature corresponding to the "good" word in the above fundamental frequency information, and it may be known that the time length information thereof does not match the duration of the corresponding character in the text of the audio template, and may be obtained 607 after interpolation processing, and the expansion and contraction processing is performed only in the time domain, so that the time length information thereof may match the duration of the corresponding character in the text of the audio template.
Step S210, according to the pitch information corresponding to each text word of the audio template and the pitch variation trend of the audio template, the pitch adjustment is performed on the target fundamental frequency information, and the adjusted target voice audio is determined.
The pitch information in the audio template may be pitch information of a text in the audio template corresponding to the voice audio, where the text in the audio template may include at least one character, each character may be a text word, the text in the audio template may be a sequence form including at least one character, and then the pitch information in the audio template may be a pitch sequence, where the pitch sequence includes pitch information corresponding to each character in the text in the audio template, and the pitch information of each character may be determined according to the pitch information of an audio portion corresponding to each character in a preset song corresponding to the audio template. The pitch information corresponding to each character in the pitch information of the audio template can be used for determining the sound level of the audio part corresponding to each text word in the voice audio corresponding to the user. I.e. each of the pitch information of the above-mentioned audio templates can be regarded as a target value. Wherein the pitch information of the audio template is also referred to as template pitch information. The terminal can adjust the pitch of the target characteristic information according to the pitch information corresponding to each character in the template pitch information and the pitch variation trend of each pitch information, so as to obtain target fundamental frequency information after pitch adjustment. The pitch change trend may be a case of increasing or decreasing between the respective pitch information in the above-described audio templates. And the terminal may also determine the adjusted target voice audio based on the pitch adjusted target fundamental frequency information. For example, the terminal may combine various fundamental frequency information in the target fundamental frequency information after pitch adjustment, for example, combine pitch, timbre, sounding characteristics, and the like, to obtain adjusted target voice audio. The target voice audio may be audio information only including the voice of the user obtained by adjusting the pronunciation of each character in the audio of the user based on the sound level of the audio template.
Specifically, the pitch adjustment may be performed with respect to fundamental frequency information corresponding to each character of a text of an audio template, and the audio template may be a ghost template, that is, the terminal may perform a word-by-word variable pitch on the fundamental frequency information according to a target melody curve formed by the template pitch information in the audio template, and re-splice lyrics after the variable pitch to obtain the target vocal audio. The terminal can segment the lyrics of the voice audio to obtain the voice audio corresponding to each character, and based on the preset song corresponding to the audio template and the pitch information of the audio part corresponding to each text word in the text of the audio template, the voice audio is subjected to word-by-word pitch adjustment comprising variable speed and tone. The terminal may implement the above-mentioned pitch adjustment by a vocoder method. The variable speed and variable tone processing can be realized by signal processing means such as TSM (Time scale modification, time domain companding), worldvocoder (world vocoder) and the like, and can also be realized by a neural network means such as WaveNet, LPCNet and the like. The WaveNet model is a sequence generation model and can be used for modeling voice generation, the LPCNet is a digital signal processing and neural network which are skillfully combined to be applied to the work of a vocoder in voice synthesis, and high-quality voice can be synthesized on a common CPU in real time.
Step S212, the template accompaniment corresponding to the audio template is obtained, and the target voice audio and the template accompaniment are fused to obtain the adjusted target audio.
Wherein, the audio template has corresponding preset songs, and the preset songs have corresponding accompaniment information, and the accompaniment information of the preset songs can be used as template accompaniment of the audio template. The accompaniment information may be a percussion-based accompaniment, for example, the accompaniment information may include, but is not limited to, relatively low sounds such as a bottom drum, a knock table, etc., which may be re-shooting information; but may also include, but is not limited to, relatively crisp sounds such as snare drums, claps, banners, etc., which may be used as tapping information. In addition, in some embodiments, the terminal may also direct the user to record instrument sounds as template accompaniment information, and the accompaniment recorded by the user may be based on percussion.
After the terminal obtains the template accompaniment corresponding to the audio template, the target voice audio and the template accompaniment can be fused to obtain the adjusted target audio. The target voice audio may be a voice audio of a user obtained by performing processing including variable speed and variable tone, and when the terminal performs fusion processing on the target voice audio and template accompaniment, operations such as beat matching and mixing processing may be included. Therefore, the terminal can obtain the adjusted target audio after performing beat matching and mixing processing on the target voice audio with template beats. Specifically, since the target voice audio may be voice audio corresponding to each character, the terminal also needs to perform self-splicing on the target voice audio of each character to obtain target voice audio containing a complete text, so that the terminal may perform the fusion processing based on the target voice audio of the complete text.
According to the audio adjusting method, the voice audio is recorded after the audio template is selected, the fundamental frequency information corresponding to the voice audio is obtained, the text in the voice audio is identified, the fundamental frequency information corresponding to the text word is determined based on the fundamental frequency and the time stamp corresponding to the text word, the fundamental frequency duration of the text word in the voice audio is obtained based on the duration information of the audio part corresponding to each text word in the audio template, the target fundamental frequency information is obtained, the pitch adjustment is carried out on the target fundamental frequency information according to the pitch information and the change trend of each text word, the adjusted target voice audio is determined, and then the target voice audio and the template accompaniment are fused to obtain the adjusted target audio. Compared with the traditional mode of manually editing the audio in a plurality of time periods, the method improves the audio adjusting effect when the ghost audio is adjusted by performing processing including time length adjustment, pitch adjustment, fusion and the like on the human voice audio based on the audio template.
In one embodiment, obtaining pitch information, timbre information and sounding characteristic information corresponding to the voice audio includes: acquiring text words corresponding to each character of an audio template in texts contained in voice audio; acquiring partial voice audio corresponding to each text word in the voice audio; aiming at a part of voice audio corresponding to each text word, acquiring a part of fundamental frequency corresponding to the part of voice audio; according to the partial fundamental frequency and the reference frequency corresponding to the partial voice frequency, determining a pitch offset value of pitch information corresponding to the partial voice frequency, and according to the reference pitch and the pitch offset value, determining the pitch information corresponding to the partial voice frequency; constructing an envelope matrix according to envelope vectors of a preset number of frequency points in the part of voice audio to obtain tone information; and obtaining sounding characteristic information according to the non-periodic information in the frequency bands of the preset number in the part of the voice audios.
In this embodiment, the terminal may obtain pitch information, tone information, and sounding characteristic information in the voice audio as the fundamental frequency information. For different types of characteristic information in the fundamental frequency information, the terminal can acquire the characteristic information from the voice audio in different modes. Wherein the voice audio may be a kind of fundamental frequency information. The terminal can acquire the fundamental frequency information according to the partial voice audio corresponding to each character in the voice audio. The part of voice audio can be voice audio corresponding to a text word. The audio template may include a reference pitch and a reference frequency corresponding to the reference pitch; the terminal may first acquire a reference pitch in the audio template and a reference frequency corresponding to the reference pitch. The reference pitch may be a pitch corresponding to a tone of the audio template, for example, 65 for tone F and 69 for tone a. The reference pitch may be stored in the audio template in advance, and the terminal may also obtain the reference pitch of the audio template by querying a preset tone pitch mapping table. The terminal may also obtain a reference frequency corresponding to the reference pitch. The reference frequency may be a frequency of a reference pitch of the tone, and the terminal may obtain the reference frequency corresponding to the reference pitch through a preset pitch frequency mapping table, where the preset pitch frequency mapping table may include a correspondence between a plurality of pitches and frequencies. For example, tone A corresponds to a reference pitch of 69 and reference pitch 69 corresponds to a reference frequency of 440Hz.
The partial voice audio may be voice audio corresponding to each text word in voice audio. The text word of the voice audio is text corresponding to each character of text included in the voice audio and text of the audio template, for a part of voice audio of text words corresponding to each character of text in the voice audio, the terminal may acquire a fundamental frequency corresponding to the part of voice audio, which is called a part of fundamental frequency, where the part of fundamental frequency may be frequency information, and in order to attach to the human ear hearing, the terminal may convert the frequency information into tone pitch information, for example, the terminal may determine a pitch offset value of the pitch information corresponding to the part of voice audio according to the part of fundamental frequency corresponding to the part of voice audio and the reference frequency, and determine the pitch information corresponding to the voice audio according to the reference pitch and the pitch offset value. Specifically, the terminal may extract the fundamental frequency of the voice audio to obtain the dry fundamental frequency information f (N), where N represents a frame index, each frame corresponds to a preset duration, for example, 5ms, and the terminal may make the number of signal frames of the current voice audio be N, where the sequence length of the voice audio is n×5ms, and the value range of N is 0,1,2, …, N-1. The pitch information may be represented by c (n), and the terminal may determine pitch information corresponding to fundamental frequency information of the voice audio based on the following formula:
Figure BDA0003837146810000111
The formula takes a template tone as an example, the reference pitch corresponding to the tone A is 69, and the corresponding reference frequency is 440Hz. In the above formula, +.>
Figure BDA0003837146810000112
Representing the above pitch offset value, where 12 represents the number of semitones in an octave space. The pitch information may be in units of semitones, one semitone may correspond to 100 parts of sound, and one octave is 1200 parts of sound.
The terminal can also construct a matrix according to the envelope vector of the preset number of frequency points in part of the voice audio to obtain the tone information. I.e. the terminal can obtain the tone information of the user through the envelope matrix. The part of the voice audio can comprise a plurality of frequency points, each frequency point can represent one frequency, and the envelope matrix can be constructed by envelope vectors in a spectrogram of the voice audio of the user. Specifically, the terminal may represent the envelope vector at the kth frequency point by e (n, k) of the nth frame signal of the above-mentioned vocal audio. Where k=0, 1,2, …, K. Where K represents the spectral width, the common values are: 1024 2048, 4096, etc.
The terminal can obtain the sounding characteristic information according to the aperiodic information in the frequency bands of the preset number in part of the voice audios. The above-mentioned part of the voice frequency can have a preset number of frequency bands, the frequency bands can be intervals formed by frequencies within a preset range, each frequency band has a plurality of frequency points, the sounding characteristic information can be information which cannot be displayed in the fundamental frequency but the user has actual sounding, for example, the user emits light sounds, the fundamental frequency information can not display the fundamental frequency of the light sounds at this time, but the light sounds emitted by the user are required to be represented, so that the terminal needs to acquire sounding characteristics, the user can have sounding characteristics when emitting each sound, and therefore the sounding characteristic information can exist in the whole process of the voice frequency. For a human voice where fundamental frequency information exists, its signal is usually periodic information, while for sound production characteristic information without fundamental frequency information, its signal is in an aperiodic form. The terminal may describe the sounding feature by an aperiodic component. Specifically, the terminal may represent the non-periodic information on the ith frequency band of the nth frame signal by a (n, I), where i=0, 1, …, I-1 represents the ith frequency band, I represents the number of frequency bands, and the value is usually 4,5, etc., that is, the terminal may divide the voice audio into I frequency bands.
After the terminal obtains the pitch information, the tone information and the sounding characteristic information, the terminal can obtain fundamental frequency information corresponding to the voice audio, and specifically, the fundamental frequency information can be represented in the following form: p (P) ana =[f(n)e(n,k)a(n,i)]. The terminal can carry out audio adjustment on the fundamental frequency information through the vocoder, and can reconstruct the adjusted audio based on the adjusted voice audio so as to realize the voice audio adjustment of the user.
According to the embodiment, the terminal can obtain a plurality of fundamental frequency information for audio adjustment through different characteristic information in the voice audio, so that the audio adjustment is performed based on the plurality of fundamental frequency information, and the adjustment effect of the audio adjustment can be improved.
In one embodiment, according to the pitch information corresponding to each text word of the audio template and the pitch variation trend of the audio template, the pitch adjustment of the target fundamental frequency information includes: acquiring target pitch information in target fundamental frequency information, and acquiring partial target pitch information corresponding to each text word in voice audio in the target pitch information; aiming at each part of target pitch information, acquiring a preset number of adjacent frames adjacent to each frame in the part of target pitch information; according to the average value of the partial target pitch information of each frame and the adjacent partial target pitch information of the adjacent frame corresponding to each frame, obtaining average pitch information corresponding to the partial target pitch information of each frame; the average pitch information characterizes the pitch variation trend of part of the target pitch information of each frame; and according to the average pitch information and the pitch information corresponding to each text word in the audio template, performing pitch adjustment on the target pitch information of the part in the target audio feature.
In this embodiment, after interpolation processing is performed on the fundamental frequency information by the terminal to obtain target fundamental frequency information that matches with the duration of the text in the audio template, pitch adjustment may be performed on the target fundamental frequency information. The target fundamental frequency information includes target pitch information, target tone information and target sounding characteristic information, and the pitch adjustment may be to adjust the target pitch information in the target fundamental frequency information. The terminal can adjust the target pitch information in the target fundamental frequency information according to the pitch information in the audio template. The pitch information of the audio template may be a sequence information, including the pitch information corresponding to each text word in the text of the audio template. The pitch information of the audio template may be used as a target value of the target pitch information, and the terminal may obtain the pitch information corresponding to the character represented by the target fundamental frequency information in the audio template, and perform pitch adjustment on the target pitch information of the character based on the pitch information in the audio template and the pitch variation trend of the audio template.
Specifically, as shown in fig. 2, the pitch adjustment of the target pitch information in the target fundamental frequency information includes a dry sound quasi-trend estimation and a calculation of a target tone, wherein the target tone can be determined based on the pitch in the audio template; the dry voice quasi-trend estimation characterizes the pitch variation trend of the target pitch information in the time span of the target pitch information, and the terminal can obtain a smoother pitch information sequence through the dry voice quasi-trend. The voice audio may include voice audio corresponding to a plurality of characters, and the terminal may perform audio adjustment by using the characters as a unit, so that each character may correspond to corresponding target fundamental frequency information. For each character in the text of the audio template, the terminal may obtain target pitch information in the target fundamental frequency information, and obtain a portion of the target pitch information corresponding to the text word in each of the above-mentioned voice audios. The text word in the voice audio is a text corresponding to each character in the text of the audio template in the voice audio. For each partial target pitch information, the terminal may acquire a preset number of adjacent frames adjacent to each frame in the partial target pitch information. The terminal may acquire the target pitch information of these preset number of adjacent frames as the adjacent portion target pitch information. The terminal can acquire the adjacent part target pitch information of each frame, and the terminal can obtain the average pitch information corresponding to the part target pitch information of each frame according to the average value of the adjacent part target pitch information corresponding to the part target pitch information of each frame and the adjacent part target pitch information corresponding to each frame. The average pitch information represents the pitch variation trend of part of target pitch information of each frame, and the terminal can obtain a smoother pitch variation curve corresponding to the character based on the average pitch information of the frames. After obtaining the average pitch information, the terminal can adjust the pitch of the pitch information part corresponding to the character in the target audio feature based on the average pitch information and the pitch information of the audio template.
Specifically, the terminal may first perform the acquisition of the smoothed pitch information on the target pitch information in the target fundamental frequency information described above. The terminal may estimate the average pitch information using mean smoothing, the calculation formula of which is as follows:
Figure BDA0003837146810000131
wherein (1)>
Figure BDA0003837146810000132
Representing a smoothed pitch information sequence corresponding to each character; c (n+l) represents adjacent target pitch information of the nth frame among the target pitch information for each of the characters; l=0.3×n, where 0.3 is an empirical value, and can be set according to practical situations, and N is the number of frames of the target pitch information. The terminal can adjust the pitch of the pitch information part corresponding to the character in the target audio feature according to the average pitch information and the pitch information of the audio template, and an adjusting formula is as follows: />
Figure BDA0003837146810000133
Wherein c (n) represents the sequence of pitch information corresponding to each character in the voice audio, c ref (n) template pitch information sequence representing the above-mentioned audio template,/or->
Figure BDA0003837146810000134
Representing the resulting pitch-adjusted target pitch information sequence. That is, the terminal can determine the offset of the pitch information corresponding to each character based on the above formula, thereby based on the template pitch information corresponding to each characterAnd determining the specific pitch of the adjusted pitch information and the offset, and realizing the pitch adjustment of the pitch information of each character.
Specifically, as shown in fig. 6, fig. 6 is a schematic diagram illustrating a pitch adjustment step in an embodiment. In fig. 6, the abscissa indicates time information, and the ordinate indicates pitch value. Taking the example that the text of the audio template is "good to play", fig. 6 may be that the pitch sequence of the above-mentioned fundamental frequency information may be pitch adjustment information corresponding to "good" words in the above-mentioned fundamental frequency information before adjustment. In fig. 6, a curve 701 is a pitch change corresponding to the pronunciation of the "good" word in the fundamental frequency information before adjustment, and the terminal performs pitch trend estimation on each pitch of the "good" word, so as to obtain a curve 703. The curve 705 is the pitch information of the "good" word in the audio template, and the terminal may adjust the pitch of the fundamental frequency information of the voice audio according to the above method, for example, adjust the pitch of the fundamental frequency corresponding to the "good" word, so as to obtain the curve 707, that is, the pitch curve after pitch adjustment, so that the pitch curve after pitch adjustment is attached to the pitch in the audio template as much as possible under the condition that the original sound characteristics of the voice audio are not damaged.
After the terminal adjusts the target pitch information of each character, the target pitch information after pitch adjustment can be obtained, so that the terminal can obtain the target fundamental frequency information after pitch adjustment according to the target pitch information after pitch adjustment, the target tone information and the target sounding characteristic information.
Through the embodiment, the terminal can obtain the target pitch information which accords with the pitch of each character in the audio template by carrying out pitch adjustment comprising pitch trend estimation and pitch adjustment on the target pitch information in the target fundamental frequency information, so that the adjustment effect in the process of carrying out the sound adjustment of the ghost animals is improved.
In one embodiment, determining the adjusted target human voice audio includes: determining a frequency adjustment value of the target pitch information after pitch adjustment according to the target pitch information after pitch adjustment and the reference pitch of the audio template; according to the reference frequency and the frequency adjustment value of the audio template, the adjusted target frequency corresponding to the voice audio is determined; and determining the adjusted target voice audio according to the target frequency, the target tone information and the target sounding characteristic information.
In this embodiment, in order to attach to the human ear hearing, the terminal may convert the frequency information of the fundamental frequency into pitch information for adjustment, and after the terminal finishes pitch adjustment, the terminal may restore the target pitch information after pitch adjustment to frequency information, so that the terminal may determine a specific sound effect in the human voice frequency based on the frequency information. For example, the terminal may determine a frequency adjustment value of the target pitch information after the pitch adjustment according to the target pitch information after the pitch adjustment and the reference pitch of the audio template, and the frequency adjustment value may determine an offset of the frequency information corresponding to each target pitch information from the reference frequency of the tone. The terminal can determine the adjusted target frequency corresponding to the voice audio according to the reference frequency of the audio template and the frequency adjustment value. The terminal can determine the target frequency of the voice audio corresponding to each character, so that the terminal can splice the target frequencies of the characters to obtain a target frequency sequence of the whole voice audio. The terminal can determine the adjusted target voice frequency according to the target frequency, the target tone information and the target sounding characteristic information. Specifically, taking the reference tone being an a tone, the reference pitch being 69, and the reference frequency being 440Hz as an example, the above-mentioned target frequency obtaining formula may be as follows:
Figure BDA0003837146810000141
Wherein (1)>
Figure BDA0003837146810000142
Representing a target pitch information sequence after pitch adjustment, f (n) representing a target frequency, and 12 representing the number of semitones in an octave space. After the terminal passes through the target frequency of the frequency calculation formula, the adjusted target voice audio can be determined together with the target frequency, the target tone information and the target sounding characteristic information. Specifically, the target parameters constituting the target human voice audioThe sequence may be P ana =[f*(n)e*(m,k)a*(m,i)]. The terminal can obtain the target melody corresponding to each character in the audio adjustment through the synthesis processing of the parameter sequences.
Through the embodiment, the terminal can obtain the target voice audio frequency based on frequency conversion of the pitch information and combining the target frequency obtained by conversion, the target tone information obtained by interpolation and the target sounding characteristic information. The adjusting effect of the sound frequency adjusting of the live stock is improved.
In one embodiment, the fusing processing is performed on the target voice audio and the template accompaniment to obtain the adjusted target audio, which includes: acquiring a template beat of a template accompaniment; matching the target voice audio with the template accompaniment according to the template beat; and mixing the matched audio to obtain the adjusted target audio.
In this embodiment, after obtaining the adjusted target voice audio, the terminal may perform fusion processing on the target voice audio and the template accompaniment, where the fusion processing includes processing such as beat matching and mixing. The template accompaniment can be a musical instrument accompaniment, the terminal can acquire the template beat of the template accompaniment and match the target voice audio with the template accompaniment according to the template beat to obtain matched target voice audio and template accompaniment, and the terminal can also mix the matched audio to obtain adjusted target audio.
The target voice audio may be information obtained by splicing the adjusted target voice audio corresponding to each character based on a character timestamp. In performing beat matching, the terminal may perform beat matching by re-timer. For example, in one embodiment, obtaining a template beat of a template accompaniment includes: acquiring audio energy values of all time points in the template accompaniment, and taking the time point corresponding to the audio energy value larger than the preset energy threshold as a re-shooting time stamp of the template accompaniment; and determining the template beat of the template accompaniment according to the plurality of duplicate time stamps. In this embodiment, the terminal may obtain the audio energy value of each time point in the template accompaniment, and take the time point corresponding to the audio energy value greater than the preset energy threshold as the re-beat timestamp of the template accompaniment, that is, the time point is the re-beat of the template accompaniment, and the terminal may obtain the template beat of the template accompaniment according to the multiple re-beat timestamps. Specifically, the template accompaniment may be a single accompaniment strike created by the user, the terminal may estimate the re-beat time stamp using the energy maximum position, and the terminal may further estimate the time stamp position of each beat using the information such as the beat number, the bar line, etc. in the template, and align the target voice audio to the corresponding time position of the re-beat. The terminal may also fill the beat locations of the unmanned sound with the user's knocks.
After the terminal determines the template beat and performs beat matching, the terminal can also perform audio mixing processing. For example, in one embodiment, mixing the matched audio to obtain the adjusted target audio includes: according to the audio energy of the template accompaniment in the matched audio, adjusting the audio energy of the target voice audio in the matched audio so that the audio energy of the target voice audio is smaller than the audio energy of the template accompaniment; and carrying out superposition and mixing processing on the target voice audio after the audio energy adjustment and the template accompaniment to obtain adjusted target audio. In this embodiment, the terminal may adjust the audio energy of the target voice audio in the matched audio according to the audio energy of the template accompaniment in the matched audio, so that the audio energy of the target voice audio is smaller than the audio energy of the template accompaniment, and may further perform superposition and mixing processing on the target voice audio after the audio energy adjustment and the template accompaniment to obtain the adjusted target audio. Specifically, the terminal may perform weighted superposition on the energy of the adjusted and matched human voice audio and the template accompaniment corresponding to the audio template, so as to prevent the dry sound energy from excessively covering the accompaniment, the loudness of the human voice audio may be lower than the accompaniment loudness by a preset value, for example, 3dB, and after the terminal performs energy adjustment on the human voice audio, the terminal may superimpose the adjusted human voice audio and the template accompaniment to obtain a final adjusted target audio for outputting.
Through the embodiment, the terminal can perform beat matching on the target voice audio after pitch adjustment and the template accompaniment, and perform audio mixing processing on the adjusted target voice audio and the template accompaniment based on the energy value, so that the audio adjusting effect on the ghost audio is improved.
In one embodiment, as shown in fig. 7, fig. 7 is a flow chart of an audio adjustment method in another embodiment. In this embodiment, the audio adjustment may be an adjustment of the sound of a live animal. The method comprises the following steps: the terminal can construct a live stock template, i.e., the audio template, by the method described above. The terminal may obtain the user's dry sound material, such as the user's recorded vocal audio. Specifically, the terminal may display each audio template, the user may select one audio template from a plurality of audio templates displayed by the terminal, and record the vocal audio according to a text corresponding to the audio template, so that the terminal may record the vocal audio input by the user, as a dry sound material, the terminal may segment lyrics of the dry sound material, implement audio adjustment based on the ghost template for each lyric, thereby obtaining dry sound melody information, that is, the adjusted target vocal audio, the terminal may further obtain accompaniment materials, obtain beat information therein, match the dry sound melody information with the beat through the ghost template, and mix the matched accompaniment with the user's dry sound melody to obtain a ghost audio, that is, the adjusted target audio, and output the ghost audio.
According to the embodiment, the terminal performs processing including time length adjustment, pitch adjustment, fusion and the like on the voice audio based on the audio template, so that the audio adjustment effect during the adjustment of the live-animal audio is improved.
In addition, in some embodiments, as shown in fig. 8, fig. 8 is a flow chart of an audio adjustment method in yet another embodiment. In this embodiment, the terminal may obtain, as the target melody, any melody that the user designates a song or a song clip, or that the user randomly hums. The terminal estimates the target melody through an automatic fundamental frequency extraction and beat detection tool, takes the target melody as an audio template, and carries out the audio adjustment process based on the audio template.
Through the embodiment, the terminal can obtain the audio template based on the melody information autonomously decided by the user, so that the flexibility of audio adjustment is improved. And the terminal can carry out processing including time length adjustment, pitch adjustment, fusion and the like on the voice audio based on the audio template, so that the audio adjustment effect is improved when the ghost audio is adjusted.
It should be understood that, although the steps in the flowcharts related to the embodiments described above are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.
Based on the same inventive concept, the embodiment of the application also provides an audio adjusting device for realizing the above-mentioned audio adjusting method. The implementation of the solution provided by the device is similar to the implementation described in the above method, so the specific limitation of the embodiment of one or more audio adjusting devices provided below may be referred to the limitation of the audio adjusting method hereinabove, and will not be repeated here.
In one embodiment, as shown in fig. 9, there is provided an audio adjusting apparatus including: recording module 500, identification module 502, determination module 504, matching module 506, adjustment module 508, and fusion module 510, wherein:
the recording module 500 is used for selecting an audio template and recording the voice audio corresponding to the audio template;
the identifying module 502 is configured to obtain fundamental frequency information corresponding to the voice audio and identify a text and a timestamp corresponding to a text word in the voice audio;
a determining module 504, configured to determine the baseband information corresponding to the text word based on the baseband information and the timestamp corresponding to the text word;
and the matching module 506 is configured to adjust the fundamental frequency duration of the Chinese character in the voice audio based on the duration information of the audio portion corresponding to each text character in the audio template, so as to obtain target fundamental frequency information.
And the adjusting module 508 is configured to adjust the pitch of the target fundamental frequency information according to the pitch information corresponding to each text word of the audio template and the pitch variation trend of the audio template, and determine the adjusted target voice audio.
And the fusion module 510 is configured to obtain a template accompaniment corresponding to the audio template, and fuse the target voice audio with the template accompaniment to obtain the adjusted target audio.
In one embodiment, the recording module 500 is specifically configured to display at least one audio template to be selected; receiving a selection instruction of at least one audio template to be selected, and determining the selected audio template; and displaying the text corresponding to the audio template, and recording the voice audio input by the user based on the text corresponding to the audio template.
In one embodiment, the identifying module 502 is specifically configured to identify an original text corresponding to the voice audio according to the fundamental frequency information; modifying the original text according to a matching result of the original text and the text corresponding to the audio template to obtain the text corresponding to the voice audio; each text word in the text corresponding to the voice audio is matched with each text word of the text corresponding to the audio template; and determining the time stamp of each text word in the voice audio according to the duration time of the corresponding audio frequency of each text word in the text corresponding to the voice audio in the voice audio.
In one embodiment, the identification module 502 is specifically configured to obtain pitch information, timbre information, and sounding characteristic information of an audio portion corresponding to each text word in the text of the audio template in the voice audio as the fundamental frequency information.
In one embodiment, the matching module 506 is specifically configured to obtain pitch information, tone information, and sounding characteristic information of each text word in the voice audio and an audio portion of a corresponding text word in the audio template, as fundamental frequency information corresponding to each text word in the voice audio; respectively performing one-dimensional linear interpolation processing on pitch information, tone information and sounding characteristic information of each text word of the voice audio according to the duration information of the audio part corresponding to each text word in the audio template, so that the duration of the pitch information, the duration of the tone information and the duration of the sounding characteristic information are respectively matched with the duration information of the audio part corresponding to each text word in the audio template; and taking the target pitch information, the target tone information and the target sounding characteristic information after interpolation processing as target fundamental frequency information.
In one embodiment, the identifying module 502 is specifically configured to obtain text words corresponding to each character of the audio template from text of the voice audio; acquiring partial voice audio corresponding to each text word in the voice audio; aiming at a part of voice audio corresponding to each text word, acquiring a part of fundamental frequency corresponding to the part of voice audio; according to the partial fundamental frequency and the reference frequency corresponding to the partial voice frequency, determining a pitch offset value of pitch information corresponding to the partial voice frequency, and according to the reference pitch and the pitch offset value, determining the pitch information corresponding to the partial voice frequency; constructing an envelope matrix according to envelope vectors of a preset number of frequency points in the part of voice audio to obtain tone information; and obtaining sounding characteristic information according to the non-periodic information in the frequency bands of the preset number in the part of the voice audios.
In one embodiment, the adjusting module 508 is specifically configured to obtain target pitch information in the target fundamental frequency information, and obtain a portion of target pitch information corresponding to each text word in the voice audio in the target pitch information; aiming at each part of target pitch information, acquiring a preset number of adjacent frames adjacent to each frame in the part of target pitch information; according to the average value of the partial target pitch information of each frame and the adjacent partial target pitch information of the adjacent frame corresponding to each frame, obtaining average pitch information corresponding to the partial target pitch information of each frame; the average pitch information characterizes the pitch variation trend of part of the target pitch information of each frame; and according to the average pitch information and the pitch information corresponding to each text word in the audio template, performing pitch adjustment on the target pitch information of the part in the target audio feature.
In one embodiment, the adjusting module 508 is specifically configured to determine a frequency adjustment value of the target pitch information after the pitch adjustment according to the target pitch information after the pitch adjustment and the reference pitch of the audio template; according to the reference frequency and the frequency adjustment value of the audio template, the adjusted target frequency corresponding to the voice audio is determined; and determining the adjusted target voice audio according to the target frequency, the target tone information and the target sounding characteristic information.
In one embodiment, the fusion module 510 is specifically configured to obtain a template beat of a template accompaniment; matching the target voice audio with the template accompaniment according to the template beat; and mixing the matched audio to obtain the adjusted target audio.
In one embodiment, the fusion module 510 is specifically configured to obtain audio energy values of each time point in the template accompaniment, and take a time point corresponding to the audio energy value greater than the preset energy threshold as a re-beat timestamp of the template accompaniment; and determining the template beat of the template accompaniment according to the plurality of duplicate time stamps.
In one embodiment, the fusion module 510 is specifically configured to adjust the audio energy of the target voice audio in the matched audio according to the audio energy of the template accompaniment in the matched audio, so that the audio energy of the target voice audio is smaller than the audio energy of the template accompaniment; and carrying out superposition and mixing processing on the target voice audio after the audio energy adjustment and the template accompaniment to obtain adjusted target audio.
In one embodiment, the obtaining module 500 is specifically configured to perform speech recognition on the voice audio, and determine an audio text corresponding to the voice audio; the audio text includes at least one character and a character time stamp for each character; extracting fundamental frequency from the audio frequency part corresponding to each character in the audio frequency text to obtain fundamental frequency information corresponding to each character; and obtaining the voice audio according to the fundamental frequency information corresponding to each character and the character time stamp corresponding to each fundamental frequency information.
The respective modules in the above-described audio adjusting apparatus may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.
In one embodiment, a computer device is provided, which may be a terminal, and an internal structure diagram thereof may be as shown in fig. 10. The computer device includes a processor, a memory, a communication interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless mode can be realized through WIFI, a mobile cellular network, NFC (near field communication) or other technologies. The computer program is executed by a processor to implement an audio adjustment method. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, can also be keys, a track ball or a touch pad arranged on the shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.
It will be appreciated by those skilled in the art that the structure shown in fig. 10 is merely a block diagram of some of the structures associated with the present application and is not limiting of the computer device to which the present application may be applied, and that a particular computer device may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.
In one embodiment, a computer device is provided that includes a memory having a computer program stored therein and a processor that implements the above-described audio adjustment method when the computer program is executed.
In one embodiment, a computer readable storage medium is provided having a computer program stored thereon, which when executed by a processor implements the above-described audio adjustment method.
In an embodiment, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the above-described audio adjustment method.
It should be noted that, user information (including but not limited to user equipment information, user personal information, etc.) and data (including but not limited to data for analysis, stored data, presented data, etc.) referred to in the present application are information and data authorized by the user or sufficiently authorized by each party.
Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in the various embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (ReRAM), magnetic random access Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (Phase Change Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in the form of a variety of forms, such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), and the like. The databases referred to in the various embodiments provided herein may include at least one of relational databases and non-relational databases. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processors referred to in the embodiments provided herein may be general purpose processors, central processing units, graphics processors, digital signal processors, programmable logic units, quantum computing-based data processing logic units, etc., without being limited thereto.
The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The above examples only represent a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the present application. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application shall be subject to the appended claims.

Claims (13)

1. A method of audio conditioning, the method comprising:
selecting an audio template, and recording voice audio corresponding to the audio template;
acquiring fundamental frequency information corresponding to the voice audio and identifying a text and a time stamp corresponding to a text word in the voice audio;
determining the fundamental frequency information corresponding to the text word based on the fundamental frequency information and the timestamp corresponding to the text word;
Adjusting the fundamental frequency duration of the Chinese character in the voice audio based on the duration information of the audio part corresponding to each text character in the audio template to obtain target fundamental frequency information;
according to the pitch information corresponding to each text word of the audio template and the pitch change trend of the audio template, performing pitch adjustment on the target fundamental frequency information, and determining adjusted target voice audio;
and acquiring a template accompaniment corresponding to the audio template, and carrying out fusion processing on the target voice audio and the template accompaniment to obtain adjusted target audio.
2. The method of claim 1, wherein selecting an audio template, recording voice audio corresponding to the template, comprises:
displaying at least one audio template to be selected;
receiving a selection instruction of the at least one audio template to be selected, and determining the selected audio template;
and displaying the text corresponding to the audio template, and recording the voice audio input by the user based on the text corresponding to the audio template.
3. The method of claim 1, wherein the identifying the corresponding time stamps for text and text words in the voice audio comprises:
Identifying an original text corresponding to the voice audio according to the fundamental frequency information;
modifying the original text according to a matching result of the original text and the text corresponding to the audio template to obtain the text corresponding to the voice audio; each text word in the text corresponding to the voice audio is matched with each text word in the text corresponding to the audio template;
and determining the time stamp of each text word in the voice audio according to the duration time of the corresponding audio frequency of each text word in the text corresponding to the voice audio frequency.
4. The method according to claim 1, wherein the acquiring the fundamental frequency information corresponding to the voice audio includes:
and acquiring pitch information, tone information and sounding characteristic information corresponding to the voice audio as fundamental frequency information.
5. The method of claim 4, wherein the adjusting the fundamental frequency duration of the Chinese character in the voice audio based on the duration information of the audio portion corresponding to each text character in the audio template to obtain the target fundamental frequency information comprises:
acquiring pitch information, tone information and sounding characteristic information of each text word in the voice audio and an audio part of a corresponding text word in the audio template, and taking the pitch information, the tone information and the sounding characteristic information as fundamental frequency information corresponding to each text word in the voice audio;
Respectively performing one-dimensional linear interpolation processing on pitch information, tone information and sounding characteristic information of each text word of the voice audio according to the duration information of the audio part corresponding to each text word in the audio template, so that the duration of the pitch information, the duration of the tone information and the duration of the sounding characteristic information are respectively matched with the duration information of the audio part corresponding to each text word in the audio template;
and taking the target pitch information, the target tone information and the target sounding characteristic information after interpolation processing as target fundamental frequency information.
6. The method of claim 5, wherein pitch-adjusting the target fundamental frequency information according to the pitch information corresponding to each text word of the audio template and the pitch variation trend of the audio template, comprises:
acquiring target pitch information in the target fundamental frequency information, and acquiring partial target pitch information corresponding to each text word in the voice audio in the target pitch information;
aiming at each part of target pitch information, acquiring a preset number of adjacent frames adjacent to each frame in the part of target pitch information;
According to the average value of the partial target pitch information of each frame and the adjacent partial target pitch information of the adjacent frame corresponding to each frame, obtaining average pitch information corresponding to the partial target pitch information of each frame; the average pitch information characterizes the pitch variation trend of part of target pitch information of each frame;
and according to the average pitch information and the pitch information corresponding to each text word in the audio template, performing pitch adjustment on the target pitch information of the part in the target audio feature.
7. The method of claim 4, wherein the audio template further comprises: a reference pitch and a reference frequency corresponding to the reference pitch; the obtaining the pitch information, tone information and sounding characteristic information corresponding to the voice audio comprises the following steps:
acquiring text words corresponding to each character of the audio template in the text of the voice audio;
acquiring partial voice audios corresponding to each text word in the voice audios;
aiming at the partial voice audio corresponding to each text word, acquiring partial fundamental frequency corresponding to the partial voice audio; according to the partial fundamental frequency corresponding to the partial voice frequency and the reference frequency, determining a pitch offset value of pitch information corresponding to the partial voice frequency, and according to the reference pitch and the pitch offset value, determining the pitch information corresponding to the partial voice frequency;
Constructing an envelope matrix according to envelope vectors of a preset number of frequency points in the part of voice audio to obtain tone information;
and obtaining the sounding characteristic information according to the non-periodic information in the frequency bands of the preset number in the part of the voice audio.
8. The method of claim 7, wherein the determining the adjusted target human voice audio comprises:
determining a frequency adjustment value of the target pitch information after pitch adjustment according to the target pitch information after pitch adjustment and the reference pitch of the audio template;
according to the reference frequency of the audio template and the frequency adjustment value, determining an adjusted target frequency corresponding to the voice audio;
and determining the adjusted target voice audio according to the target frequency, the target tone information and the target sounding characteristic information.
9. The method of claim 1, wherein the fusing the target vocal audio with the template accompaniment to obtain the adjusted target audio comprises:
acquiring a template beat of the template accompaniment;
matching the target voice audio with the template accompaniment according to the template beat;
And mixing the matched audio to obtain the adjusted target audio.
10. The method of claim 9, wherein the obtaining the template beat of the template accompaniment comprises:
acquiring audio energy values of all time points in the template accompaniment, and taking the time point corresponding to the audio energy value larger than a preset energy threshold as a re-shooting time stamp of the template accompaniment;
and determining the template beat of the template accompaniment according to the plurality of the re-beat time stamps.
11. The method of claim 9, wherein mixing the matched audio to obtain the adjusted target audio comprises:
according to the audio energy of the template accompaniment in the matched audio, adjusting the audio energy of the target voice audio in the matched audio so that the audio energy of the target voice audio is smaller than the audio energy of the template accompaniment;
and carrying out superposition and mixing processing on the target voice audio after the audio energy adjustment and the template accompaniment to obtain adjusted target audio.
12. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any one of claims 1 to 11 when the computer program is executed.
13. A computer program product comprising a computer program, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any one of claims 1 to 11.
CN202211090743.3A 2022-09-07 2022-09-07 Audio adjustment method, computer device and program product Pending CN116312425A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211090743.3A CN116312425A (en) 2022-09-07 2022-09-07 Audio adjustment method, computer device and program product

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211090743.3A CN116312425A (en) 2022-09-07 2022-09-07 Audio adjustment method, computer device and program product

Publications (1)

Publication Number Publication Date
CN116312425A true CN116312425A (en) 2023-06-23

Family

ID=86776733

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211090743.3A Pending CN116312425A (en) 2022-09-07 2022-09-07 Audio adjustment method, computer device and program product

Country Status (1)

Country Link
CN (1) CN116312425A (en)

Similar Documents

Publication Publication Date Title
TWI774967B (en) Method and device for audio synthesis, storage medium and calculating device
CN109036355B (en) Automatic composing method, device, computer equipment and storage medium
JP6565530B2 (en) Automatic accompaniment data generation device and program
CN112382257B (en) Audio processing method, device, equipment and medium
CN105740394A (en) Music generation method, terminal, and server
WO2020082574A1 (en) Generative adversarial network-based music generation method and device
US20210256960A1 (en) Information processing method and information processing system
CN111370024A (en) Audio adjusting method, device and computer readable storage medium
CN113874932A (en) Electronic musical instrument, control method for electronic musical instrument, and storage medium
CN112669811B (en) Song processing method and device, electronic equipment and readable storage medium
JP6565528B2 (en) Automatic arrangement device and program
JP6760450B2 (en) Automatic arrangement method
KR101813704B1 (en) Analyzing Device and Method for User's Voice Tone
JP2013164609A (en) Singing synthesizing database generation device, and pitch curve generation device
CN112992110B (en) Audio processing method, device, computing equipment and medium
CN116312425A (en) Audio adjustment method, computer device and program product
JP2013210501A (en) Synthesis unit registration device, voice synthesis device, and program
JP6693596B2 (en) Automatic accompaniment data generation method and device
CN113781989A (en) Audio animation playing and rhythm stuck point identification method and related device
JP5879813B2 (en) Multiple sound source identification device and information processing device linked to multiple sound sources
CN115004294A (en) Composition creation method, composition creation device, and creation program
JP5699496B2 (en) Stochastic model generation device for sound synthesis, feature amount locus generation device, and program
CN112489607A (en) Method and device for recording songs, electronic equipment and readable storage medium
CN113539214B (en) Audio conversion method, audio conversion device and equipment
CN114582306A (en) Audio adjusting method and computer equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination