WO2022089097A1 - 音频处理方法、装置及电子设备和计算机可读存储介质 - Google Patents

音频处理方法、装置及电子设备和计算机可读存储介质 Download PDF

Info

Publication number
WO2022089097A1
WO2022089097A1 PCT/CN2021/119539 CN2021119539W WO2022089097A1 WO 2022089097 A1 WO2022089097 A1 WO 2022089097A1 CN 2021119539 W CN2021119539 W CN 2021119539W WO 2022089097 A1 WO2022089097 A1 WO 2022089097A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio
harmony
span
cent
dry
Prior art date
Application number
PCT/CN2021/119539
Other languages
English (en)
French (fr)
Inventor
徐东
Original Assignee
腾讯音乐娱乐科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯音乐娱乐科技(深圳)有限公司 filed Critical 腾讯音乐娱乐科技(深圳)有限公司
Priority to US18/034,207 priority Critical patent/US20230402047A1/en
Publication of WO2022089097A1 publication Critical patent/WO2022089097A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • G06F16/68Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/683Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/685Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using automatically derived transcript of audio data, e.g. lyrics
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H1/00Details of electrophonic musical instruments
    • G10H1/02Means for controlling the tone frequencies, e.g. attack or decay; Means for producing special musical effects, e.g. vibratos or glissandos
    • G10H1/06Circuits for establishing the harmonic content of tones, or other arrangements for changing the tone colour
    • G10H1/08Circuits for establishing the harmonic content of tones, or other arrangements for changing the tone colour by combining tones
    • G10H1/10Circuits for establishing the harmonic content of tones, or other arrangements for changing the tone colour by combining tones for obtaining chorus, celeste or ensemble effects
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H1/00Details of electrophonic musical instruments
    • G10H1/36Accompaniment arrangements
    • G10H1/38Chord
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/003Changing voice quality, e.g. pitch or formants
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/87Detection of discrete points within a voice signal
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2210/00Aspects or methods of musical processing having intrinsic musical character, i.e. involving musical theory or musical parameters or relying on musical knowledge, as applied in electrophonic musical tools or instruments
    • G10H2210/031Musical analysis, i.e. isolation, extraction or identification of musical elements or musical parameters from a raw acoustic signal or from an encoded audio signal
    • G10H2210/066Musical analysis, i.e. isolation, extraction or identification of musical elements or musical parameters from a raw acoustic signal or from an encoded audio signal for pitch analysis as part of wider processing for musical purposes, e.g. transcription, musical performance evaluation; Pitch recognition, e.g. in polyphonic sounds; Estimation or use of missing fundamental
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2210/00Aspects or methods of musical processing having intrinsic musical character, i.e. involving musical theory or musical parameters or relying on musical knowledge, as applied in electrophonic musical tools or instruments
    • G10H2210/155Musical effects
    • G10H2210/245Ensemble, i.e. adding one or more voices, also instrumental voices
    • G10H2210/261Duet, i.e. automatic generation of a second voice, descant or counter melody, e.g. of a second harmonically interdependent voice by a single voice harmonizer or automatic composition algorithm, e.g. for fugue, canon or round composition, which may be substantially independent in contour and rhythm
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L2013/021Overlap-add techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/90Pitch determination of speech signals

Definitions

  • the present application relates to the technical field of audio processing, and more particularly, to an audio processing method, apparatus, electronic device, and computer-readable storage medium.
  • the audio of the dry voice recorded by the user is directly collected by an audio collection device. Since most users have not undergone professional singing training, their control over the vocal, oral, and even chest resonance when singing is still blank. Therefore, the dry audio directly recorded by the user has a poor auditory effect. It can be seen that in the process of implementing the present invention, the inventor found that there are at least the following problems in the related art: the hearing effect of the dry audio frequency is poor.
  • the purpose of this application is to provide an audio processing method, an apparatus, an electronic device, and a computer-readable storage medium, so as to improve the auditory effect of dry audio.
  • a first aspect of the present application provides an audio processing method, including:
  • Each of the lyric words is respectively carried out to the corresponding first tone span and a plurality of different second tone spans, respectively, to obtain a first harmony and a plurality of different second harmony;
  • the first cent span is a positive integer number of cents
  • the plurality of different second cent spans are the sum of the first cent span and a plurality of different third cent spans
  • the first cent span the sub-span differs from the third-cent span by an order of magnitude
  • the first harmony and a plurality of different second harmony are synthesized to form a multi-track harmony, and the multi-track harmony and the target dry audio are mixed to obtain a synthesized dry audio.
  • an audio processing device comprising:
  • the acquisition module is used to obtain the target dry voice audio, and determines the start and end time of each lyric word in the target dry voice audio;
  • a detection module configured to detect the pitch of the target dry audio frequency and the fundamental frequency within the start and end time of each segment, and to determine the tone name of each of the lyrics based on the fundamental frequency and the pitch;
  • the rising tone module is used to carry out the rising tone processing of the corresponding first tone span and a plurality of different second tone spans to each of the lyric words, respectively, to obtain a first harmony and a plurality of different second tone. Harmony; wherein, the first cent span is a positive integer number of cents, and the multiple different second cent spans are the sum of the first cent span and a plurality of different third cent spans , the first cent span differs from the third cent span by an order of magnitude;
  • a synthesis module configured to synthesize the first harmony and a plurality of the second harmony to form a multi-track harmony
  • a mixing module configured to mix the multi-track harmony and the target dry audio to obtain a synthesized dry audio.
  • a third aspect of the present application provides an electronic device, including:
  • the processor is configured to implement the steps of the above audio processing method when executing the computer program.
  • a fourth aspect of the present application provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the steps of the audio processing method described above are implemented. .
  • an audio processing method includes: acquiring target dry audio audio, determining the start and end time of each lyric word in the target dry audio audio; detecting the raising and lowering of the target dry audio audio
  • the fundamental frequency within the starting and ending time of each segment, and the current tone name of each lyric word is determined based on the fundamental frequency and the increase; and a plurality of different second-cent spans of rising pitch processing, respectively obtaining a first harmonic and a plurality of different second harmonics;
  • the first cent-span is a positive integer number of cents, and a plurality of different
  • the second cent span is the sum of the first cent span and a plurality of different third cent spans, and the first cent span and the third cent span differ by an order of magnitude;
  • the first harmony and a plurality of different second harmony are synthesized to form a multi-track harmony, and the multi-track harmony and the target dry audio are mixed to obtain a synthesized dry audio.
  • the target dry voice input by the user is subjected to a rising process with a first cent span of an integer number of cents, which can make the raised first harmony more musical. , which is more in line with the listening habits of the human ear.
  • multiple different second harmonics are generated by the perturbation method, and the multi-track harmony formed by the first harmonic and multiple different second harmonics realizes the simulation of multiple recordings of singers in the actual scene, avoiding the need for Single-track harmonies with thin aural effects.
  • the multi-track harmony and the target dry audio are mixed to obtain a synthetic dry audio that is more suitable for human hearing, which improves the layering of the dry audio. It can be seen that the audio processing method provided by the present application improves the auditory effect of dry audio.
  • the present application also discloses an audio processing device, an electronic device, and a computer-readable storage medium, which can also achieve the above technical effects.
  • FIG. 1 is an architectural diagram of an audio processing system provided by an embodiment of the present application.
  • FIG. 2 is a flowchart of a first audio processing method provided by an embodiment of the present application.
  • FIG. 3 is a flowchart of a second audio processing method provided by an embodiment of the present application.
  • FIG. 4 is a flowchart of a third audio processing method provided by an embodiment of the present application.
  • FIG. 5 is a flowchart of a fourth audio processing method provided by an embodiment of the present application.
  • FIG. 6 is a structural diagram of an audio processing apparatus provided by an embodiment of the present application.
  • FIG. 7 is a structural diagram of an electronic device provided by an embodiment of the present application.
  • FIG. 1 shows an architecture diagram of an audio processing system provided by an embodiment of the present application. As shown in FIG. 1 , it includes an audio collection device 10 and a server 20 .
  • the audio collection device 10 is used to collect the target dry sound audio recorded by the user, and the server 20 is used to upscale the target dry sound audio to obtain a multi-track harmony, and mix the multi-track harmony and the target dry audio to obtain a more suitable sound. Synthetic dry audio with human hearing.
  • the audio processing system can also be a client 30, which can include fixed terminals such as PC (full name in Chinese: personal computer, full name in English: Personal Computer) and mobile terminals such as mobile phones, and a speaker can be set on the client 30 for outputting synthetic dry sound Audio or songs based on synthetic dry audio.
  • client 30 can include fixed terminals such as PC (full name in Chinese: personal computer, full name in English: Personal Computer) and mobile terminals such as mobile phones, and a speaker can be set on the client 30 for outputting synthetic dry sound Audio or songs based on synthetic dry audio.
  • the embodiment of the present application discloses an audio processing method, which improves the hearing effect of dry audio.
  • FIG. 2 a flowchart of a first audio processing method provided by an embodiment of the present application, as shown in FIG. 2 , includes:
  • S101 obtain the target dry voice audio, and determine the start and end time of each lyric word in the target dry voice audio;
  • the execution subject of this embodiment is the server in the audio processing system provided by the above-mentioned embodiments, and the purpose is to process the target dry sound audio recorded by the user to obtain synthesized dry sound audio that is more suitable for human hearing.
  • the audio collection device collects the target dry audio audio recorded by the user, and sends it to the server.
  • the target dry sound audio is a dry sound waveform file recorded by the user, and this embodiment does not limit the audio format of the target dry sound audio, which may include MP3, WAV (Waveform Audio File Format), FLAC (full name in Chinese: lossless Audio compression coding, English full name: Free Lossless Audio Codec), OGG (OGG Vorbis) and other formats.
  • lossless encoding formats such as FLAC and WAV may be used.
  • the server first obtains the lyrics text corresponding to the target dry voice audio, and can directly obtain the lyrics file corresponding to the target dry voice audio, or directly extract the lyrics text from the target dry voice audio, that is, from the dry voice audio.
  • the lyric text corresponding to the dry sound is directly identified, which is not specifically limited here. It is understandable that since the target dry audio audio recorded by the user may contain noise, which may lead to inaccurate recognition of lyrics, the training dry audio can be denoised before recognizing the text of the lyrics.
  • each lyric word in the target dry audio audio is obtained from the lyric text.
  • the lyrics are generally stored in the form of the lyrics and the beginning and ending time of the lyrics.
  • the expression form of a lyric text is: Tai[0,1000]Yang[1000,1500]When [1500,3000]Empty[3000 ,3300]
  • the content in brackets represents the start and end time of each lyric word, in milliseconds, that is, the start time of "tai" is 0 milliseconds, the end time is 1000 milliseconds, and the The start time is 1000 milliseconds, the end time is 1500 milliseconds, etc., and the lyric text extracted accordingly is "tai, sun, when, empty, Zhao".
  • the lyrics can also be in other languages. Taking English as an example, the extracted text of the lyrics is "the, sun, is, rising". Finally, the phonetic symbol of each lyric word is determined according to the text type of each lyric word. If the text type of the lyric word is Chinese, the corresponding phonetic symbol is pinyin. For example, for the lyric text “tai, yang, when, empty, Zhao" For example, its corresponding pinyin is "tai yang dang kong zhao", and if the text type of the lyrics is English, its corresponding phonetic symbol is English phonetic symbol.
  • S102 Detect the pitch of the target dry sound audio and the fundamental frequency within the start and end time of each segment, and determine the current tone name of each of the lyrics based on the fundamental frequency and the pitch;
  • the increase of the input target dry sound audio frequency is detected, the fundamental frequency within the start and end time of each segment is determined, and the current frequency of each lyric word is obtained by analyzing the fundamental frequency of the sound during the start and end time of each lyric word in combination with the increase. musical alphabet. For example, there is a lyric word "you" in the time (t1, t2), since the dry sound has been raised, by extracting the fundamental frequency of the sound in the time period (t1, t2), the corresponding lyric word can be obtained. musical alphabet.
  • S103 Perform the raising process of the corresponding first tone span and a plurality of different second tone spans on each of the lyric words, respectively, to obtain a first harmony and a plurality of different second harmony; wherein , the first cent span is a positive integer number of cents, the plurality of different second cent spans are the sum of the first cent span and a plurality of different third cent spans, and the first cent span The one-cent span differs from the third-cent span by an order of magnitude;
  • each lyric word in the target dry audio is subjected to a corresponding first tone span and a plurality of different second tone spans, respectively, to obtain a first harmony and a plurality of different the second harmony.
  • the first cent span is a positive integer number of cents
  • the cent here is the key
  • the cent span refers to the cent difference between the raised target cent and the current cent
  • the first harmony is equivalent to the pair Chord rise for the target dry audio.
  • the second cent span is the sum of the first cent span and a plurality of different third cent spans
  • the third cent span is an order of magnitude lower than the first cent span, that is, the second harmonic is equivalent to the first and Fine tuning of the sound.
  • the first pitch span is determined according to the music theory of major triads and minor triads, that is, the first pitch span and a plurality of different second pitch spans are respectively performed on each of the lyrics.
  • the steps of obtaining the first harmony and a plurality of different second harmony respectively include: determining a preset pitch name span, and performing an ascending of the preset pitch name span for each of the lyrics.
  • the first harmonic is obtained by key processing; wherein, the adjacent tone names differ by one or two of the first cent spans; the first harmonic is subjected to a plurality of different third cent spans of rising tones. Get multiple different second harmonies.
  • each lyric word in the target dry voice audio is subjected to a rising pitch process of a preset pitch-name span to obtain a first harmony.
  • the first harmonic is subjected to a plurality of different third-cent spans of rising tones to obtain a plurality of different second harmonics.
  • the preset sound name span refers to the sound name difference between the target sound name after the rising tone and the current sound name. Tone names are equivalent to a 12-cent sharp.
  • the frequency is doubled, for example, from 440Hz to 880Hz; if 3 keys are raised, it is equivalent to the frequency becoming 2 to the 3/12th power (about 1.189 times), such as Changed from 440Hz to 523Hz.
  • the preset sound name span is not specifically limited here, and those skilled in the art can flexibly select it according to the actual situation, generally not more than 7, and preferably 2.
  • the cent span between adjacent note names can be 1key or 2key.
  • Table 1 “+key” is the cent span between adjacent note names. .
  • performing the ascending process on each of the lyric words by the preset pitch name span to obtain the first harmony including: determining according to the current pitch name and the preset pitch name span of each of the lyric words The target phonetic name after each lyric word is processed by the rising tone; based on the phonetic span between the target phonetic name and the current phonetic name of each lyric word, determine the number of first phonetic spans corresponding to each lyric word ; Carry out the rising tone processing of the corresponding number of first cent spans to each of the lyrics words to obtain the first harmony.
  • the number of the first tone span of each lyric word can be determined, and the corresponding number of each lyric word can be determined.
  • the rising of the first cent span results in the first harmony.
  • the target sound name after the increase of 2 sound names is G
  • the span of the first sound between the target sound name and the current sound name is 3, that is, the actual sound rises
  • the above-mentioned rising tone processing method is based on the music theory of major and minor chords. This processing method can make the raised tone more musical and more in line with the listening habits of the human ear.
  • each lyric word is subjected to corresponding rise processing to obtain the rise result of the target dry voice audio, that is, the first harmony after the chord rise, which is a single-track harmony.
  • the pitch-raising manner in this embodiment is to increase the fundamental frequency of the sound to obtain a sound with a raised pitch in hearing sense.
  • the above-mentioned single-track harmony is subjected to a slight pitch change, that is, the third-tone span is raised to obtain the processing result of the multi-track harmony.
  • the span of the third cent is not specifically limited here, and those skilled in the art can flexibly select it according to the actual situation, which generally does not exceed 1 key.
  • Each second harmonic has a different preset cent span relative to the first harmonic rising, for example, it may be 0.05key, 0.1key, 0.15key, 0.2key, and so on.
  • the number of tracks of the second harmony is also not limited here, for example, it can be 3 tracks, 5 tracks, 7 tracks, etc., corresponding to 3 preset sub-spans, 5 preset sub-spans and 7 presets respectively. cent span.
  • S104 Synthesize the first harmony and a plurality of different second harmony to form a multi-track harmony, and mix the multi-track harmony and the target dry audio to obtain a synthesized dry audio.
  • synthesizing the first harmony and a plurality of different second harmony to form a multi-track harmony includes: determining the first harmony and each of the second harmony Corresponding volume and time delay of the harmony; corresponding to the first harmony and each of the second harmony according to the corresponding volume and time delay of the first harmony and each of the second harmony to mix to obtain the synthesized dry audio.
  • first determine the volume and delay of each track when mixing.
  • a a ⁇ SH i +delay
  • a a ⁇ SH i +delay
  • delay is generally 1 and 30, in milliseconds, and can also be other values.
  • a i is the volume coefficient of the ith track harmony
  • SH i is the ith track harmony
  • delay i is the delay coefficient of the ith track harmony
  • m is the total number of tracks of the multi-track harmony.
  • the target dry audio input input by the user is subjected to the ascending process of the first cent span of an integer number of cents, so that the ascended first harmony can be more The sense of music is more in line with the listening habits of the human ear.
  • multiple different second harmonics are generated by the perturbation method, and the multi-track harmony formed by the first harmonic and multiple different second harmonics realizes the simulation of multiple recordings of singers in the actual scene, avoiding the need for Single-track harmonies with thin aural effects.
  • the multi-track harmony and the target dry audio are mixed to obtain a synthetic dry audio that is more suitable for human hearing, which improves the layering of the dry audio. It can be seen that the audio processing method provided by the embodiment of the present application improves the hearing effect of dry audio.
  • the method further includes: using an audio effect device for the synthetic dry audio adding sound effects to the audio frequency; obtaining the accompaniment audio corresponding to the synthesized dry audio audio, and superimposing the accompaniment audio and the synthesized dry audio audio after adding the sound effect according to a preset method to obtain the synthesized audio.
  • synthesized target dry sound audio can be combined with the accompaniment to generate a final song, and the synthesized song can be stored in the background of the server, output to the client or played through a speaker.
  • the synthesized target dry sound audio may be processed by sound effects devices such as a reverberator and an equalizer to obtain dry sound audio with a certain sound effect.
  • sound effects devices such as a reverberator and an equalizer to obtain dry sound audio with a certain sound effect.
  • the sound effect device There are many options for the sound effect device here, such as processing through sound effect plug-ins, sound effect algorithms, etc., which are not specifically limited here. Since the target dry audio is pure vocal audio without instrumental music, it is actually different from common songs in life. For example, it does not include a prelude without vocals. If there is no accompaniment, the prelude will be a silence. Therefore, it is necessary to superimpose the effect-added target dry sound audio and the accompaniment audio according to a preset method to obtain a synthesized audio, that is, a song.
  • the specific stacking manner is not limited here, and the technology in the art can flexibly select according to the actual situation.
  • superimposing the accompaniment audio and the target dry sound audio after adding the sound effect according to a preset method to obtain synthetic audio including: performing power on the accompaniment audio and the target dry sound audio after adding the sound effect. Normalization processing is performed to obtain intermediate accompaniment audio and intermediate dry sound audio; and the synthesized audio is obtained by superimposing the intermediate accompaniment audio and the intermediate dry sound audio according to a preset energy ratio.
  • the corresponding harmony is obtained by processing the original dry sound released by the user, and the harmony is obtained by mixing the original dry sound of the user.
  • the processed song works have a more pleasant listening characteristic, that is, the musical appeal of the works published by the user is improved, thereby helping to improve the user's satisfaction with the use.
  • it will also help to enhance the content providers of the singing platform to gain greater influence and competitiveness.
  • the embodiment of the present application discloses an audio processing method. Compared with the previous embodiment, the present embodiment further describes and optimizes the technical solution. specific:
  • FIG. 3 a flowchart of a second audio processing method provided by an embodiment of the present application, as shown in FIG. 3 , includes:
  • S201 obtain the target dry voice audio, and determine the start and end time of each lyric word in the target dry voice audio;
  • S202 Extract the audio features of the target dry audio frequency; wherein, the audio features include fundamental frequency features and spectrum information;
  • the purpose of this step is to extract the audio features of the training dry audio, and the audio features are closely related to the vocalization characteristics and sound quality of the target dry audio.
  • the audio features here may include fundamental frequency features and spectral information.
  • the fundamental frequency feature refers to the lowest vibration frequency of a piece of dry audio, which reflects the pitch of the dry audio. The larger the value of the fundamental frequency, the higher the pitch of the dry audio.
  • Spectral information refers to the distribution curve of the target dry audio frequency.
  • the height-adjusted classifier here may include a common Hidden Markov Model (Hidden Markov Model, HMM), a Support Vector Machine (SVM), a deep learning model, etc., which are not specifically limited here.
  • HMM Hidden Markov Model
  • SVM Support Vector Machine
  • S204 Detect the fundamental frequency within the start and end time of each section, and determine the current sound name of each of the lyrics based on the fundamental frequency and the boost;
  • S205 Determine a preset pitch-name span, perform a rising tone process on each of the lyric words by the preset pitch-name span to obtain a first harmony, and perform a plurality of different third pitches on the first harmony
  • the rising tone processing of the sub-spans obtains a plurality of different second harmonics; wherein, adjacent phonetic names differ by one or two of the first pitch-spans;
  • S206 Synthesize the first harmonies and a plurality of different second harmonies to form multi-track harmonies, and mix the multi-track harmonies with the target dry sound audio to obtain synthetic dry sound audio.
  • the pitch of the target dry audio is obtained by inputting the audio features of the target dry audio into the pitch classifier, which improves the detection accuracy of pitch up.
  • the embodiment of the present application discloses an audio processing method. Compared with the first embodiment, this embodiment further describes and optimizes the technical solution. specific:
  • FIG. 4 a flowchart of a third audio processing method provided by an embodiment of the present application, as shown in FIG. 4 , includes:
  • S301 obtain the target dry voice audio, and determine the start and end time of each lyric word in the target dry voice audio;
  • S302 Detect the pitch of the target dry sound audio and the fundamental frequency within the start and end time of each segment, and determine the current tone name of each of the lyrics based on the fundamental frequency and the pitch;
  • S303 Determine a preset pitch-name span, perform a rising tone process on each of the lyric words by the preset pitch-name span to obtain a first harmony, and perform a plurality of different third pitches on the first harmony
  • the rising tone processing of the span obtains a plurality of different second harmonics
  • the third harmonic is obtained by performing the rising tone processing of the third cent span on the target dry audio audio; wherein, the adjacent tone names differ by one or two. a span of said first cents;
  • S304 Synthesize the third harmony, the first harmony, and a plurality of different second harmony to form a multi-track harmony, and mix the multi-track harmony and the target dry audio, Get synthetic dry audio.
  • the target dry voice audio in order to ensure the singing characteristics of different users, can be directly raised to a small extent, that is, each lyric word in the target dry voice audio can be raised to a preset tone span Processed to obtain a third harmony, and added the raised third harmony to the multitrack harmony. Harmony is obtained by the method based on the rising dry sound, which can bring a better listening effect to the original dry sound created by the user, and improve the quality of the user's published works.
  • synthesizing the third harmony, the first harmony and a plurality of different second harmony to form a multi-track harmony including: determining the third harmony , the volume and time delay corresponding to the first harmony and each of the second harmony; according to the volume and time delay corresponding to the third harmony, the first harmony and each of the second harmony
  • the delay synthesizes the third harmony, the first harmony and a plurality of the second harmony to form a multi-track harmony.
  • the embodiment of the present application discloses an audio processing method. Compared with the first embodiment, this embodiment further describes and optimizes the technical solution. specific:
  • FIG. 5 a flowchart of a fourth audio processing method provided by an embodiment of the present application, as shown in FIG. 5 , includes:
  • S401 obtain the target dry voice audio, and determine the start and end time of each lyric word in the target dry voice audio;
  • S402 Extract audio features of the target dry audio; wherein, the audio features include fundamental frequency features and spectrum information;
  • S404 Detect the fundamental frequency within the start and end time of each segment, and determine the current sound name of each of the lyrics based on the fundamental frequency and the increase;
  • S405 Determine a preset pitch name span, perform a rising pitch process on each of the lyric words by the preset pitch name span to obtain a first harmony, and perform a plurality of different third pitches on the first harmony
  • the rising tone processing of the span obtains a plurality of different second harmonics
  • the third harmonic is obtained by performing the rising tone processing of the third cent span on the target dry audio audio; wherein, the adjacent tone names differ by one or two. a span of said first cents;
  • S406 Synthesize the third harmony, the first harmony, and a plurality of different second harmony to form a multi-track harmony, and mix the multi-track harmony and the target dry audio, Get synthetic dry audio.
  • the pitch of the target dry audio is obtained by inputting the audio features of the target dry audio into the pitch classifier, which improves the accuracy of detecting the pitch.
  • the pitch classifier By processing the dry sound recorded by the user, a multi-track harmony with more layered and fullness is obtained, and the mixed single-track harmony is obtained through organic mixing, which improves the layering of the dry audio and makes it more pleasing to the ear. , which improves the hearing of dry audio.
  • this embodiment can be processed in the background of the computer or in the cloud, and the processing efficiency is high and the running speed is fast.
  • the user records dry audio through the audio collection device of the K song client, and the server performs audio processing on the dry audio, which may specifically include the following steps:
  • the pitch of the input dry audio is detected.
  • the start and end times of the lyric words are obtained through the lyric time, and the fundamental frequency of the sound within the start and end times is analyzed to obtain the pitch of the lyric words within the start and end times.
  • the sound in the beginning and ending time is raised.
  • Each lyric word is processed with corresponding rising to get the rising result of the dry voice, that is, the harmony after the chord rising.
  • the way of raising the pitch is to increase the fundamental frequency of the sound to obtain a sound with a raised pitch in the sense of hearing. Since there is only one track of harmony, it is referred to as a single-track harmony here, and recorded as harmony B.
  • this step first determine the volume and delay of each track when mixing, and then superimpose the harmony of each track according to the processing of volume and delay to obtain a mixed harmony of one track.
  • Step 4 Add accompaniment and reverb to get the processed song
  • the processed sound of the song is output, such as outputting to a mobile terminal, storing in the background, playing through the speaker of the terminal, and the like.
  • An audio processing apparatus provided by an embodiment of the present application is introduced below.
  • An audio processing apparatus described below and an audio processing method described above can be referred to each other.
  • FIG. 6 a structural diagram of an audio processing apparatus provided by an embodiment of the present application, as shown in FIG. 5 , includes:
  • Obtaining module 100 is used to obtain the target dry voice audio, and determine the start and end time of each lyric word in the target dry voice audio;
  • the detection module 200 is used to detect the pitch of the target dry audio frequency and the fundamental frequency within the start and end time of each segment, and to determine the sound name of each of the lyrics based on the fundamental frequency and the pitch;
  • the rising tone module 300 is used to perform the rising tone processing of the corresponding first tone span and a plurality of different second tone spans on each of the lyric words, respectively, to obtain a first harmony and a plurality of different first tone.
  • Two harmonics wherein, the first cent span is a positive integer number of cents, and the plurality of different second cent spans are the difference between the first cent span and a plurality of different third cent spans and, the first cent span differs from the third cent span by an order of magnitude;
  • a synthesis module 400 configured to synthesize the first harmony and a plurality of different second harmony to form a multi-track harmony
  • the mixing module 500 is configured to mix the multi-track harmony and the target dry audio to obtain a synthesized dry audio.
  • the audio processing device provided by the embodiment of the present application first performs the ascending process on the target dry sound audio input by the user in the first cent span of an integer number of cents based on the chord music theory, so that the ascended first harmony can be more The sense of music is more in line with the listening habits of the human ear.
  • multiple different second harmonics are generated by the perturbation method, and the multi-track harmony formed by the first harmonic and multiple different second harmonics realizes the simulation of multiple recordings of singers in the actual scene, avoiding the need for Single-track harmonies with thin aural effects.
  • the multi-track harmony and the target dry audio are mixed to obtain a synthetic dry audio that is more suitable for human hearing, which improves the layering of the dry audio. It can be seen that the audio processing device provided by the embodiment of the present application improves the hearing effect of dry audio.
  • the detection module 200 includes:
  • an extraction unit used for extracting the audio features of the target dry audio frequency; wherein, the audio features include fundamental frequency features and spectrum information;
  • an input unit used for inputting the audio feature into a pitch-up classifier to obtain the pitch-up of the target dry sound audio
  • the first determining unit is configured to detect the fundamental frequency within the starting and ending time of each segment, and determine the current pitch name of each lyric word based on the fundamental frequency and the increase.
  • the rising tone module 300 specifically performs the rising tone processing on each of the lyric words with a preset pitch-name span to obtain a first harmony, and for the first harmony
  • the synthesis module 400 specifically synthesizes the third harmony, the first harmony and a plurality of different second harmony to form a multi-track harmony, and mixes the multi-track harmony and the target dry audio frequency to obtain a module for synthesizing dry audio frequency.
  • the synthesis module 400 includes:
  • a second determining unit configured to determine the volume and time delay corresponding to the third harmony, the first harmony, and each of the second harmony
  • the synthesis unit is configured to combine the third harmony, the first harmony and the polyphony according to the volume and time delay corresponding to the third harmony, the first harmony and each of the second harmony the second harmonies are synthesized to form multi-track harmonies;
  • a mixing unit configured to mix the multi-track harmony and the target dry audio to obtain synthetic dry audio.
  • an adding module used for adding sound effects to the synthesized dry sound audio by using a sound effect device
  • the superimposing module is configured to obtain the accompaniment audio corresponding to the synthesized dry sound audio, and superimpose the accompaniment audio and the synthesized dry sound audio after adding the sound effect according to a preset method to obtain the synthesized audio.
  • the superimposing module includes:
  • an acquisition unit used for acquiring the accompaniment audio corresponding to the synthetic dry sound audio
  • a normalization processing unit for performing power normalization processing on the accompaniment audio and the synthesized dry audio audio after adding the sound effect, to obtain the intermediate accompaniment audio and the intermediate dry audio audio;
  • a superimposing unit configured to superimpose the intermediate accompaniment audio and the intermediate dry audio audio according to a preset energy ratio to obtain the synthesized audio.
  • the pitch-raising module 300 includes:
  • the first rising tone unit is used to determine the preset tone name span, and perform the rising tone processing of the preset tone name span on each of the lyric words to obtain the first harmony; wherein, adjacent tone names differ by one or two. the first cent span;
  • the second rising tone unit is configured to perform multiple different rising of the third tone span on the first harmonic to obtain multiple different second harmonics.
  • the first tone-up unit includes:
  • the first determination subunit is used to determine the preset pitch name span, and according to the current pitch name of each described lyric word and the preset pitch name span, determine the target pitch name of each lyric word after rising tones;
  • the second determination subunit is used to determine the corresponding first tone span quantity of each described lyric word based on the tone span between the target phonetic name of each described lyric word and the current phonetic name;
  • the rising tone subunit is used to perform the rising tone processing of a corresponding number of first cent spans on each of the lyric words to obtain a first harmony.
  • the present application also provides an electronic device.
  • a structural diagram of an electronic device 70 provided by an embodiment of the present application, as shown in FIG. 7 may include a processor 71 and a memory 72 .
  • the processor 71 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the like.
  • the processor 71 can use at least one hardware form among DSP (Digital Signal Processing, digital signal processing), FPGA (Field-Programmable Gate Array, field programmable gate array), PLA (Programmable Logic Array, programmable logic array) accomplish.
  • the processor 71 may also include a main processor and a coprocessor.
  • the main processor is a processor used to process data in the wake-up state, also called CPU (Central Processing Unit, central processing unit); the coprocessor is A low-power processor for processing data in a standby state.
  • the processor 71 may be integrated with a GPU (Graphics Processing Unit, image processor), and the GPU is used for rendering and drawing the content that needs to be displayed on the display screen.
  • the processor 71 may further include an AI (Artificial Intelligence, artificial intelligence) processor, where the AI processor is used to process computing operations related to machine learning.
  • AI Artificial Intelligence, artificial intelligence
  • Memory 72 may include one or more computer-readable storage media, which may be non-transitory. Memory 72 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In this embodiment, the memory 72 is at least used to store the following computer program 721 , where, after the computer program is loaded and executed by the processor 71 , it can implement the relevant aspects of the audio processing method executed by the server side disclosed in any of the foregoing embodiments. step. In addition, the resources stored in the memory 72 may also include an operating system 722, data 723, etc., and the storage mode may be short-term storage or permanent storage. The operating system 722 may include Windows, Unix, Linux, and the like.
  • the electronic device 70 may further include a display screen 73 , an input/output interface 74 , a communication interface 75 , a sensor 76 , a power supply 77 and a communication bus 78 .
  • the structure of the electronic device shown in FIG. 7 does not constitute a limitation on the electronic device in the embodiments of the present application.
  • the electronic device may include more or less components than those shown in FIG. 7 , or a combination of some part.
  • a computer-readable storage medium including program instructions is also provided, and when the program instructions are executed by a processor, the steps of the audio processing method executed by the server in any of the foregoing embodiments are implemented.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Quality & Reliability (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Library & Information Science (AREA)
  • Theoretical Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Electrophonic Musical Instruments (AREA)

Abstract

一种音频处理方法、装置及一种电子设备和计算机可读存储介质,该方法包括:获取目标干声音频,确定目标干声音频中每个歌词字的起止时间(S101);检测目标干声音频的调高和每段起止时间内的基频,并基于基频和调高确定每个歌词字的当前音名(S102);对每个歌词字分别进行对应的第一音分跨度和多个不同的第二音分跨度的升调处理,分别得到第一和声和多个不同的第二和声;其中,第一音分跨度为正整数个音分,多个不同的第二音分跨度为第一音分跨度与多个不同的第三音分跨度的和,第一音分跨度与第三音分跨度相差一个数量级(S103);将第一和声和多个不同的第二和声进行合成形成多轨和声,混合多轨和声和目标干声音频,得到合成干声音频(S104)。提供的音频处理方法,提高了干声音频的听觉效果。

Description

音频处理方法、装置及电子设备和计算机可读存储介质
本申请要求于2020年10月28日提交中国专利局、申请号为202011171384.5、发明名称为“音频处理方法、装置及电子设备和计算机可读存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及音频处理技术领域,更具体地说,涉及音频处理方法、装置及电子设备和计算机可读存储介质。
背景技术
对于唱歌场景,在相关技术中,利用音频采集设备直接采集用户录制的干声音频。由于大多数用户没有经过专业的唱歌训练,他们对于唱歌时的嗓音、口腔,甚至胸腔共鸣等方面的控制仍处于空白状态,因此,用户直接录制的干声音频听觉效果较差。可见,在实现本发明过程中,发明人发现相关技术中至少存在如下问题:干声音频的听觉效果较差。
因此,如何提高干声音频的听觉效果是本领域技术人员需要解决的技术问题。
发明内容
本申请的目的在于提供一种音频处理方法、装置及一种电子设备和一种计算机可读存储介质,提高了干声音频的听觉效果。
为实现上述目的,本申请第一方面提供了一种音频处理方法,包括:
获取目标干声音频,确定所述目标干声音频中每个歌词字的起止时间;
检测所述目标干声音频的调高和每段所述起止时间内的基频,并基于所述基频和所述调高确定每个所述歌词字的当前音名;
对每个所述歌词字分别进行对应的第一音分跨度和多个不同的第二音分跨度的升调处理,分别得到第一和声和多个不同的第二和声;其中,所述第一音分跨度为正整数个音分,多个不同的所述第二音分跨度为所述第 一音分跨度与多个不同的第三音分跨度的和,所述第一音分跨度与所述第三音分跨度相差一个数量级;
将所述第一和声和多个不同的所述第二和声进行合成形成多轨和声,混合所述多轨和声和所述目标干声音频,得到合成干声音频。
为实现上述目的,本申请第二方面提供了一种音频处理装置,包括:
获取模块,用于获取目标干声音频,确定所述目标干声音频中每个歌词字的起止时间;
检测模块,用于检测所述目标干声音频的调高和每段所述起止时间内的基频,并基于所述基频和所述调高确定每个所述歌词字的音名;
升调模块,用于对每个所述歌词字分别进行对应的第一音分跨度和多个不同的第二音分跨度的升调处理,分别得到第一和声和多个不同的第二和声;其中,所述第一音分跨度为正整数个音分,多个不同的所述第二音分跨度为所述第一音分跨度与多个不同的第三音分跨度的和,所述第一音分跨度与所述第三音分跨度相差一个数量级;
合成模块,用于将所述第一和声和多个所述第二和声进行合成形成多轨和声;
混合模块,用于混合所述多轨和声和所述目标干声音频,得到合成干声音频。
为实现上述目的,本申请第三方面提供了一种电子设备,包括:
存储器,用于存储计算机程序;
处理器,用于执行所述计算机程序时实现如上述音频处理方法的步骤。
为实现上述目的,本申请第四方面提供了一种计算机可读存储介质,所述计算机可读存储介质上存储有计算机程序,所述计算机程序被处理器执行时实现如上述音频处理方法的步骤。
通过以上方案可知,本申请提供的一种音频处理方法,包括:获取目标干声音频,确定所述目标干声音频中每个歌词字的起止时间;检测所述目标干声音频的调高和每段所述起止时间内的基频,并基于所述基频和所述调高确定每个所述歌词字的当前音名;对每个所述歌词字分别进行对应的第一音分跨度和多个不同的第二音分跨度的升调处理,分别得到第一和 声和多个不同的第二和声;其中,所述第一音分跨度为正整数个音分,多个不同的所述第二音分跨度为所述第一音分跨度与多个不同的第三音分跨度的和,所述第一音分跨度与所述第三音分跨度相差一个数量级;将所述第一和声和多个不同的所述第二和声进行合成形成多轨和声,混合所述多轨和声和所述目标干声音频,得到合成干声音频。
本申请提供的音频处理方法,首先基于和弦乐理对用户输入的目标干声音频进行整数个音分的第一音分跨度的升调处理,可以使得升调后的第一和声更具有乐感,更加符合人耳的听音习惯。其次通过微扰变调方法生成多个不同的第二和声,第一和声和多个不同的第二和声形成的多轨和声实现了对实际场景下歌手多次录制的模拟,避免了单轨和声单薄的听觉效果。最后对多轨和声和目标干声音频进行混合得到更加适配人耳听感的合成干声音频,提升了干声音频的层次感。由此可见,本申请提供的音频处理方法,提高了干声音频的听觉效果。本申请还公开了一种音频处理装置及一种电子设备和一种计算机可读存储介质,同样能实现上述技术效果。
应当理解的是,以上的一般描述和后文的细节描述仅是示例性的,并不能限制本申请。
附图说明
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。附图是用来提供对本公开的进一步理解,并且构成说明书的一部分,与下面的具体实施方式一起用于解释本公开,但并不构成对本公开的限制。在附图中:
图1为本申请实施例提供的一种音频处理系统的架构图;
图2为本申请实施例提供的第一种音频处理方法的流程图;
图3为本申请实施例提供的第二种音频处理方法的流程图;
图4为本申请实施例提供的第三种音频处理方法的流程图;
图5为本申请实施例提供的第四种音频处理方法的流程图;
图6为本申请实施例提供的一种音频处理装置的结构图;
图7为本申请实施例提供的一种电子设备的结构图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
为了便于理解本申请提供的音频处理方法,下面对其使用的系统进行介绍。参见图1,其示出了本申请实施例提供的一种音频处理系统的架构图,如图1所示,包括音频采集设备10和服务器20。
音频采集设备10用于采集用户录制的目标干声音频,服务器20用于对目标干声音频进行升调处理得到多轨和声,并对多轨和声和目标干声音频进行混合得到更加适配人耳听感的合成干声音频。
当然,音频处理系统还可以客户端30,可以包括PC(中文全称:个人计算机,英文全称:Personal Computer)等固定终端和手机等移动终端,客户端30上可以设置扬声器,用于输出合成干声音频或基于合成干声音频合成的歌曲。
本申请实施例公开了一种音频处理方法,提高了干声音频的听觉效果。
参见图2,本申请实施例提供的第一种音频处理方法的流程图,如图2所示,包括:
S101:获取目标干声音频,确定所述目标干声音频中每个歌词字的起止时间;
本实施例的执行主体为上述实施例提供的音频处理系统中的服务器,目的为对用户录制的目标干声音频进行处理以得到更加适配人耳听感的合 成干声音频。在本步骤中,音频采集设备采集用户录制的目标干声音频,将其发送至服务器。需要说明的是,目标干声音频为用户录制的干声波形文件,本实施例不对目标干声音频的音频格式进行限定,可以包括MP3、WAV(Waveform Audio File Format)、FLAC(中文全称:无损音频压缩编码,英文全称:Free Lossless Audio Codec)、OGG(OGG Vorbis)等格式。优选的,为了保证声音信息的不丢失,可以采用FLAC、WAV等无损编码格式。
在具体实施中,服务器首先获取目标干声音频对应的歌词文本,可以直接获取该目标干声音频对应的歌词文件,也可以直接在从目标干声音频中提取歌词文本,即从干声音频中直接识别干声对应的歌词文本,在此不进行具体限定。可以理解的是,由于用户录制的目标干声音频中可能包含噪音导致识别歌词不准确,因此可以在识别歌词文本之前对训练干声音频进行降噪处理。
其次,从歌词文本中获取目标干声音频中的每个歌词字。可以理解的是,歌词一般以歌词字和歌词起止时间的方式进行存储,例如,一段歌词文本的表现形式为:太[0,1000]阳[1000,1500]当[1500,3000]空[3000,3300]照[3300,5000],中括号中的内容代表每个歌词字的起止时间,单位为毫秒,即“太”的起始时间为0毫秒,终止时间为1000毫秒,“阳”的起始时间为1000毫秒,终止时间为1500毫秒等,据此提取的歌词文本为“太,阳,当,空,照”。当然,歌词也可以为其他语言类型,以英语为例,提取的歌词文本为“the,sun,is,rising”。最后根据每个歌词字的文字类型确定每个歌词字的音标,若歌词字的文字类型为汉字,则其对应的音标为拼音,例如,对于歌词文本“太,阳,当,空,照”来说,其对应的拼音为“tai yang dang kong zhao”,若歌词字的文字类型为英文,则其对应的音标为英文音标。
S102:检测所述目标干声音频的调高和每段所述起止时间内的基频,并基于所述基频和所述调高确定每个所述歌词字的当前音名;
在本步骤中,检测输入的目标干声音频的调高,确定每段起止时间内的基频,通过分析每个歌词字的起止时间内声音的基频结合调高得到每个歌词字的当前音名。例如,在时间(t1,t2)内有一个歌词字“你”,由于已 经得到干声的调高,通过提取(t1,t2)时间段内声音的基频,即可得到该歌词字对应的音名。
S103:对每个所述歌词字分别进行对应的第一音分跨度和多个不同的第二音分跨度的升调处理,分别得到第一和声和多个不同的第二和声;其中,所述第一音分跨度为正整数个音分,多个不同的所述第二音分跨度为所述第一音分跨度与多个不同的第三音分跨度的和,所述第一音分跨度与所述第三音分跨度相差一个数量级;
本步骤的目的在于对目标干声音频进行升调处理,以更加符合人耳听感。在具体实施中,对目标干声音频中的每个歌词字分别进行对应的第一音分跨度和多个不同的第二音分跨度的升调处理,分别得到第一和声和多个不同的第二和声。其中,第一音分跨度为正整数个音分,此处的音分即为key,音分跨度指升调后的目标音分与当前音分的音分差,第一和声相当于对目标干声音频的和弦升调。第二音分跨度为第一音分跨度与多个不同的第三音分跨度的和,第三音分跨度比第一音分跨度低一个数量级,即第二和声相当于对第一和声的微调变调。
可以理解的是,本领域技术人员可以直接设置第一音分跨度和多个不同的第三音分跨度的具体值,也可以预设音名跨度和多个不同的第三音分跨度,程序基于预设第一音名跨度依据大三和弦与小三和弦的乐理确定第一音分跨度,即所述对每个所述歌词字分别进行对应的第一音分跨度和多个不同的第二音分跨度的升调处理,分别得到第一和声和多个不同的第二和声的步骤包括:确定预设音名跨度,并对每个所述歌词字进行预设音名跨度的升调处理得到第一和声;其中,相邻音名相差一个或两个所述第一音分跨度;对所述第一和声进行多个不同的所述第三音分跨度的升调处理得到多个不同的第二和声。在具体实施中,首先对目标干声音频中的每个歌词字进行预设音名跨度的升调处理得到第一和声。其次,对第一和声进行多个不同的第三音分跨度的升调处理得到多个不同的第二和声。可以理解的是,预设音名跨度是指升调之后的目标音名与当前音名之间的音名差,音名(对固定高度的音所定的名称)可以包括CDEFGAB,升调七个音名相当于升调12音分。如果升满12个key相当于频率变为原来的2倍,比 如从440Hz变为880Hz;如果升3个key,相当于频率变为2的3/12幂次方(约为1.189倍),比如从440Hz变为523Hz。此处不对预设音名跨度进行具体限定,本领域技术人员可以根据实际情况进行灵活选择,一般不超过7,优选为2。根据大三和弦与小三和弦的乐理,相邻音名之间的音分跨度可以为1key或2key,具体参见表1,表1中“+key”即为相邻音名之间的音分跨度。
表1
音名 C D E F G A B C
唱名 do re mi fa so la si do
简谱 1 2 3 4 5 6 7 1
+key   +2 +2 +1 +2 +2 +2 +1
作为一种可行的实施方式,对每个所述歌词字进行预设音名跨度的升调处理得到第一和声,包括:根据每个所述歌词字的当前音名和预设音名跨度确定每个歌词字经升调处理后的目标音名;基于每个所述歌词字的目标音名与当前音名之间的音分跨度确定每个所述歌词字对应的第一音分跨度数量;对每个所述歌词字进行对应数量的第一音分跨度的升调处理得到第一和声。
在具体实施中,基于每个歌词字目标音名与当前音名之间的音分跨度,可以确定每个歌词字升调处理的第一音分跨度数量,对每个歌词字进行对应数量的第一音分跨度的升调处理得到第一和声。以预设音名跨度为2为例,若(t1,t2)的时间段内的歌词字“你”的当前音名name=C,根据表1可知,其对应的唱名为do,对应的简谱为1,则对歌词字“你”上升2个音名后的目标音名为E,目标音名与当前音名之间的音分差即第一音分跨度为4,即实际音分(key)上升了4个key,分别为C至D的2个key和D至E的2个key。若另一歌词字的当前音名name=E,则上升2个音名后的目标音名为G,目标音名与当前音名之间的第一音分跨度为3,即实际音分上升了3个key,分别为E至F的1个key,和F至G的2个key。上述升调处理方式依据大三和弦与小三和弦的乐理,这种处理方式可以让升调后 的声音更具有乐感,更符合人耳的听音习惯。
按照上述方式将每个歌词字进行相应的升调处理,得到目标干声音频的升调结果,即经过和弦升调之后的第一和声,其为单轨和声。可以理解的是,本实施例中的升调方式是通过将声音基频增大,得到听感上音调升高的声音。
对上述单轨和声进行小幅度的变调,即进行第三音分跨度的升调处理,获得多轨和声的处理结果。此处不对第三音分跨度进行具体限定,本领域技术人员可以根据实际情况进行灵活选择,一般不超过1key。每个第二和声相对于第一和声升调的预设音分跨度不同,例如,可以为0.05key、0.1key、0.15key、0.2key等。此处同样不对第二和声的轨数进行限定,例如,可以为3轨、5轨、7轨等,分别对应3个预设音分跨度、5个预设音分跨度和7个预设音分跨度。
对单轨和声进行小幅度的变调实际上是模拟实际场景下歌手多次录制的情况,因为人多次录制同一首歌曲时,很难保证音准在每次录制时都完全一样,即会有一点音准上的浮动,恰恰这种浮动带来了更丰富的混合体验,避免的单薄的效果。可见,多轨和声可以增加干声音频的层次感。
S104:将所述第一和声和多个不同的所述第二和声进行合成形成多轨和声,混合所述多轨和声和所述目标干声音频,得到合成干声音频。
在本步骤中,将上一步骤得到的第一和声和多个不同的第二和声合成为多轨和声,并将多轨和声和目标干声音频进行混合得到合成干声音频。作为一种可行的实施方式,将所述第一和声和多个不同的所述第二和声进行合成形成多轨和声,包括:确定所述第一和声和每个所述第二和声对应的的音量和时延;按照所述第一和声和每个所述第二和声对应的的音量和时延对所述第一和声和每个所述第二和声对应的进行混合得到所述合成干声音频。在具体实施中,首先确定每一轨混合时的音量和时延。以a表示音量、delay表示时延,那么处理后的第i轨和声SH i可以表示为:y=a×SH i+delay。此处的a一般为0.2,也可以为其它值,delay一般为1和30,单位为毫秒,也可以为其它值。然后将每一轨和声按照音量、时延的处理进行叠加,即可得到混合后的合成干声音频。公式表示为:
Figure PCTCN2021119539-appb-000001
其中,a i为第i轨和声的音量系数,SH i为第i轨的和声,delay i为第i轨和声的时延系数,m为多轨和声的总轨数。
本申请实施例提供的音频处理方法,首先基于和弦乐理对用户输入的目标干声音频进行整数个音分的第一音分跨度的升调处理,可以使得升调后的第一和声更具有乐感,更加符合人耳的听音习惯。其次通过微扰变调方法生成多个不同的第二和声,第一和声和多个不同的第二和声形成的多轨和声实现了对实际场景下歌手多次录制的模拟,避免了单轨和声单薄的听觉效果。最后对多轨和声和目标干声音频进行混合得到更加适配人耳听感的合成干声音频,提升了干声音频的层次感。由此可见,本申请实施例提供的音频处理方法,提高了干声音频的听觉效果。
在上述实施例的基础上,作为一种优选实施方式,所述混合所述多轨和声和所述目标干声音频,得到合成干声音频之后,还包括:利用音效器件为所述合成干声音频增加音效;获取所述合成干声音频对应的伴奏音频,将所述伴奏音频与增加音效后的合成干声音频按照预设方式进行叠加得到合成音频。
可以理解的是,合成的目标干声音频可以结合伴奏生成最终的歌曲,合成的歌曲可以在服务器的后台进行存储、输出至客户端或通过扬声器进行播放。
在具体实施中,可以通过混响器、均衡器等音效器件对合成的目标干声音频进行处理,得到有一定音效的干声音频。这里的音效器有很多可以选择的方式,例如通过音效插件、音效算法等方式处理,在此不进行具体限定。由于目标干声音频为纯人声音频,没有器乐的声音,这其实和生活中常见的歌曲有区别,例如不包含没有人声唱的前奏部分,如果没有伴奏,前奏部分为一段静音。因此需要将增加效后的目标干声音频与伴奏音频按照预设方式进行叠加得到合成音频即歌曲。
此处不对具体的叠加方式进行限定,本领域技术可以根据实际情况进行灵活选择。作为一种可行的实施方式,将所述伴奏音频与增加音效后的 目标干声音频按照预设方式进行叠加得到合成音频,包括:对所述伴奏音频与增加音效后的目标干声音频进行功率归一化处理,得到中间伴奏音频和中间干声音频;按照预设的能量比例对所述中间伴奏音频和所述中间干声音频进行叠加得到所述合成音频。在具体实施中,对伴奏音频与增加音效后的目标干声音频分别进行功率归一化处理,得到中间伴奏音频accom和中间干声音频vocal,其均为时域波形,若预设的能量比例为0.6:0.4,则合成音频W=0.6×vocal+0.4×accom。
由此可见,在此实施方式下,利用算法的高效、稳健和准确的优势,通过对用户发布的原始干声进行处理,获得对应的和声,将和声与用户的原始干声进行混合得到处理后的歌曲作品,该作品在听感上具有更加好听的特点,即提升用户发布作品的音乐感染力,从而有助于提升用户使用的满意度。另外,也有助于提升唱歌平台的内容提供商获得更大的影响力和竞争力。
本申请实施例公开了一种音频处理方法,相对于上一实施例,本实施例对技术方案作了进一步的说明和优化。具体的:
参见图3,本申请实施例提供的第二种音频处理方法的流程图,如图3所示,包括:
S201:获取目标干声音频,确定所述目标干声音频中每个歌词字的起止时间;
S202:提取所述目标干声音频的音频特征;其中,所述音频特征包括基频特征和频谱信息;
本步骤的目的在于提取训练干声音频的音频特征,该音频特征与目标干声音频的发声特点与音质密切相关。此处的音频特征可以包括基频特征和频谱信息。基频特征是指一段干声音频的最低振动频率,反映的是干声音频的音高,基频的数值越大,干声音频的音调越高。频谱信息是指目标干声音频频率的分布曲线。
S203:将所述音频特征输入调高分类器中得到所述目标干声音频的调高;
在本步骤中,将音频特征输入调高分类器中得到目标干声音频的调高。此处的调高分类器可以包括常见的隐马尔科夫模型(Hidden Markov Model,HMM)、支持向量机(Support Vector Machine,SVM)、深度学习模型等,在此不进行具体限定。
S204:检测每段所述起止时间内的基频,并基于所述基频和所述调高确定每个所述歌词字的当前音名;
S205:确定预设音名跨度,对每个所述歌词字进行预设音名跨度的升调处理得到第一和声,并对所述第一和声进行多个不同的所述第三音分跨度的升调处理得到多个不同的第二和声;其中,相邻音名相差一个或两个所述第一音分跨度;
S206:将所述第一和声和多个不同的所述第二和声进行合成形成多轨和声,混合所述多轨和声和所述目标干声音频,得到合成干声音频。
由此可见,在本实施例中,通过将目标干声音频的音频特征输入调高分类器中得到目标干声音频的调高,提高了检测调高的准确性。
本申请实施例公开了一种音频处理方法,相对于第一个实施例,本实施例对技术方案作了进一步的说明和优化。具体的:
参见图4,本申请实施例提供的第三种音频处理方法的流程图,如图4所示,包括:
S301:获取目标干声音频,确定所述目标干声音频中每个歌词字的起止时间;
S302:检测所述目标干声音频的调高和每段所述起止时间内的基频,并基于所述基频和所述调高确定每个所述歌词字的当前音名;
S303:确定预设音名跨度,对每个所述歌词字进行预设音名跨度的升调处理得到第一和声,对所述第一和声进行多个不同的所述第三音分跨度的升调处理得到多个不同的第二和声,对所述目标干声音频进行所述第三音分跨度的升调处理得到第三和声;其中,相邻音名相差一个或两个所述第一音分跨度;
S304:将所述第三和声、所述第一和声和多个不同的所述第二和声进 行合成形成多轨和声,混合所述多轨和声和所述目标干声音频,得到合成干声音频。
在本实施例中,为了保证不同用户的演唱特色,可以对目标干声音频直接进行小幅度的升调处理,即对目标干声音频中的每个歌词字进行预设音分跨度的升调处理得到第三和声,并将升调处理后的第三和声加入多轨和声中。通过基于升调干声的方式获得和声,该和声可以为用户创作的原始干声带来更好听的听感效果,提升用户发布作品的质量。
作为一种可行的实施方式,将所述第三和声、所述第一和声和多个不同的所述第二和声进行合成形成多轨和声,包括:确定所述第三和声、所述第一和声和每个所述第二和声对应的音量和时延;按照所述第三和声、所述第一和声和每个所述第二和声对应的音量和时延将所述第三和声、所述第一和声和多个所述第二和声进行合成形成多轨和声。上述过程与第一个实施例介绍的过程类似,在此不再赘述。
由此可见,本实施例可以通过对用户录制干声进行处理,首先得到符合和弦调式的单轨和声,然后得到更具层次感与饱满度的多轨和声,通过有机混合得到混合后的单轨和声,该和声与干声叠加得到处理后的人声,相比原始的用户干声,在听感上更加好听悦耳,提升了用户作品的内容质量,提高了用户的满意度。
本申请实施例公开了一种音频处理方法,相对于第一个实施例,本实施例对技术方案作了进一步的说明和优化。具体的:
参见图5,本申请实施例提供的第四种音频处理方法的流程图,如图5所示,包括:
S401:获取目标干声音频,确定所述目标干声音频中每个歌词字的起止时间;
S402:提取所述目标干声音频的音频特征;其中,所述音频特征包括基频特征和频谱信息;
S403:将所述音频特征输入调高分类器中得到所述目标干声音频的调高;
S404:检测每段所述起止时间内的基频,并基于所述基频和所述调高确定每个所述歌词字的当前音名;
S405:确定预设音名跨度,对每个所述歌词字进行预设音名跨度的升调处理得到第一和声,对所述第一和声进行多个不同的所述第三音分跨度的升调处理得到多个不同的第二和声,对所述目标干声音频进行所述第三音分跨度的升调处理得到第三和声;其中,相邻音名相差一个或两个所述第一音分跨度;
S406:将所述第三和声、所述第一和声和多个不同的所述第二和声进行合成形成多轨和声,混合所述多轨和声和所述目标干声音频,得到合成干声音频。
由此可见,本实施例通过将目标干声音频的音频特征输入调高分类器中得到目标干声音频的调高,提高了检测调高的准确性。通过对用户录制干声进行处理,得到更具层次感与饱满度的多轨和声,通过有机混合得到混合后的单轨和声,提升了干声音频的层次感,在听感上更加好听悦耳,提高了干声音频的听觉效果。另外,本实施例既可以通过计算机后台处理,也可以通过云端处理,处理效率高,运行速度快。
为了便于理解,结合本申请的一种应用场景进行介绍。结合图1,在K歌场景下,用户通过K歌客户端的音频采集设备录制干声音频,服务器对该干声音频进行音频处理,具体可以包括以下步骤:
步骤1:和弦升调
在本步骤中,首先,检测输入干声音频的调高。然后,通过歌词时间获得歌词字的起止时间,分析该起止时间内声音的基频,得到该起止时间内歌词字的音调。最后,通过大三和弦与小三和弦的乐理,对该起止时间内的声音进行升调处理。将每个歌词字都进行相应的升调处理,得到干声的升调结果,即经过和弦升调之后的和声。其中,升调的方式是通过将声音基频增大,得到听感上音调升高的声音。由于只有一轨的和声,这里简称为单轨和声,记为和声B。
步骤2:微扰变调
在本步骤中,首先,通过将干声进行+0.1key的升调,得到和声A。然后,将和声B分别进行+0.1key、+0.15key、+0.2key的升调,得到和声C、D、E。最后,将这些和声统一起来,记为5轨和声SH=[A,B,C,D,E]。
步骤3:多轨混合
在本步骤中,首先确定每一轨混合时的音量和时延,然后将每一轨和声按照音量、时延的处理进行叠加,即可得到混合后的一轨和声。
步骤4:增加伴奏与混响,得到处理完成的歌曲;
步骤5:输出
在本步骤,将处理完成后的歌曲声音进行输出,比如输出到移动终端、后台存储、通过终端的扬声器播放等方式。
下面对本申请实施例提供的一种音频处理装置进行介绍,下文描述的一种音频处理装置与上文描述的一种音频处理方法可以相互参照。
参见图6,本申请实施例提供的一种音频处理装置的结构图,如图5所示,包括:
获取模块100,用于获取目标干声音频,确定所述目标干声音频中每个歌词字的起止时间;
检测模块200,用于检测所述目标干声音频的调高和每段所述起止时间内的基频,并基于所述基频和所述调高确定每个所述歌词字的音名;
升调模块300,用于对每个所述歌词字分别进行对应的第一音分跨度和多个不同的第二音分跨度的升调处理,分别得到第一和声和多个不同的第二和声;其中,所述第一音分跨度为正整数个音分,多个不同的所述第二音分跨度为所述第一音分跨度与多个不同的第三音分跨度的和,所述第一音分跨度与所述第三音分跨度相差一个数量级;
合成模块400,用于将所述第一和声和多个不同的所述第二和声进行合成形成多轨和声;
混合模块500,用于混合所述多轨和声和所述目标干声音频,得到合成干声音频。
本申请实施例提供的音频处理装置,首先基于和弦乐理对用户输入的目标干声音频进行整数个音分的第一音分跨度的升调处理,可以使得升调后的第一和声更具有乐感,更加符合人耳的听音习惯。其次通过微扰变调方法生成多个不同的第二和声,第一和声和多个不同的第二和声形成的多轨和声实现了对实际场景下歌手多次录制的模拟,避免了单轨和声单薄的听觉效果。最后对多轨和声和目标干声音频进行混合得到更加适配人耳听感的合成干声音频,提升了干声音频的层次感。由此可见,本申请实施例提供的音频处理装置,提高了干声音频的听觉效果。
在上述实施例的基础上,作为一种优选实施方式,所述检测模块200包括:
提取单元,用于提取所述目标干声音频的音频特征;其中,所述音频特征包括基频特征和频谱信息;
输入单元,用于将所述音频特征输入调高分类器中得到所述目标干声音频的调高;
第一确定单元,用于检测每段所述起止时间内的基频,并基于所述基频和所述调高确定每个所述歌词字的当前音名。
在上述实施例的基础上,作为一种优选实施方式,所述升调模块300具体为对每个所述歌词字进行预设音名跨度的升调处理得到第一和声,对所述第一和声进行多个预设音分跨度的升调处理得到多个第二和声,对所述目标干声音频进行所述第三音分跨度的升调处理得到第三和声的模块;
相应的,所述合成模块400具体为将所述第三和声、所述第一和声和多个不同的所述第二和声进行合成形成多轨和声,混合所述多轨和声和所述目标干声音频,得到合成干声音频的模块。
在上述实施例的基础上,作为一种优选实施方式,所述合成模块400包括:
第二确定单元,用于确定所述第三和声、所述第一和声和每个所述第二和声对应的音量和时延;
合成单元,用于按照所述第三和声、所述第一和声和每个所述第二和 声对应的音量和时延将所述第三和声、所述第一和声和多个所述第二和声进行合成形成多轨和声;
混合单元,用于混合所述多轨和声和所述目标干声音频,得到合成干声音频。
在上述实施例的基础上,作为一种优选实施方式,还包括:
增加模块,用于利用音效器件为所述合成干声音频增加音效;
叠加模块,用于获取所述合成干声音频对应的伴奏音频,将所述伴奏音频与增加音效后的合成干声音频按照预设方式进行叠加得到合成音频。
在上述实施例的基础上,作为一种优选实施方式,所述叠加模块包括:
获取单元,用于获取所述合成干声音频对应的伴奏音频;
归一化处理单元,用于对所述伴奏音频与增加音效后的合成干声音频进行功率归一化处理,得到中间伴奏音频和中间干声音频;
叠加单元,用于按照预设的能量比例对所述中间伴奏音频和所述中间干声音频进行叠加得到所述合成音频。
在上述实施例的基础上,作为一种优选实施方式,所述升调模块300包括:
第一升调单元,用于确定预设音名跨度,并对每个所述歌词字进行预设音名跨度的升调处理得到第一和声;其中,相邻音名相差一个或两个所述第一音分跨度;
第二升调单元,用于对所述第一和声进行多个不同的所述第三音分跨度的升调处理得到多个不同的第二和声。
在上述实施例的基础上,作为一种优选实施方式,所述第一升调单元包括:
第一确定子单元,用于确定预设音名跨度,并根据每个所述歌词字的当前音名和预设音名跨度确定每个歌词字经升调处理后的目标音名;
第二确定子单元,用于基于每个所述歌词字的目标音名与当前音名之间的音分跨度确定每个所述歌词字对应的第一音分跨度数量;
升调子单元,用于对每个所述歌词字进行对应数量的第一音分跨度的升调处理得到第一和声。
关于上述实施例中的装置,其中各个模块执行操作的具体方式已经在有关该方法的实施例中进行了详细描述,此处将不做详细阐述说明。
本申请还提供了一种电子设备,参见图7,本申请实施例提供的一种电子设备70的结构图,如图7所示,可以包括处理器71和存储器72。
其中,处理器71可以包括一个或多个处理核心,比如4核心处理器、8核心处理器等。处理器71可以采用DSP(Digital Signal Processing,数字信号处理)、FPGA(Field-Programmable Gate Array,现场可编程门阵列)、PLA(Programmable Logic Array,可编程逻辑阵列)中的至少一种硬件形式来实现。处理器71也可以包括主处理器和协处理器,主处理器是用于对在唤醒状态下的数据进行处理的处理器,也称CPU(Central Processing Unit,中央处理器);协处理器是用于对在待机状态下的数据进行处理的低功耗处理器。在一些实施例中,处理器71可以在集成有GPU(Graphics Processing Unit,图像处理器),GPU用于负责显示屏所需要显示的内容的渲染和绘制。一些实施例中,处理器71还可以包括AI(Artificial Intelligence,人工智能)处理器,该AI处理器用于处理有关机器学习的计算操作。
存储器72可以包括一个或多个计算机可读存储介质,该计算机可读存储介质可以是非暂态的。存储器72还可包括高速随机存取存储器,以及非易失性存储器,比如一个或多个磁盘存储设备、闪存存储设备。本实施例中,存储器72至少用于存储以下计算机程序721,其中,该计算机程序被处理器71加载并执行之后,能够实现前述任一实施例公开的由服务器侧执行的音频处理方法中的相关步骤。另外,存储器72所存储的资源还可以包括操作系统722和数据723等,存储方式可以是短暂存储或者永久存储。其中,操作系统722可以包括Windows、Unix、Linux等。
在一些实施例中,电子设备70还可包括有显示屏73、输入输出接口74、通信接口75、传感器76、电源77以及通信总线78。
当然,图7所示的电子设备的结构并不构成对本申请实施例中电子设备的限定,在实际应用中电子设备可以包括比图7所示的更多或更少的部件,或者组合某些部件。
在另一示例性实施例中,还提供了一种包括程序指令的计算机可读存储介质,该程序指令被处理器执行时实现上述任一实施例服务器所执行的音频处理方法的步骤。
说明书中各个实施例采用递进的方式描述,每个实施例重点说明的都是与其他实施例的不同之处,各个实施例之间相同相似部分互相参见即可。对于实施例公开的装置而言,由于其与实施例公开的方法相对应,所以描述的比较简单,相关之处参见方法部分说明即可。应当指出,对于本技术领域的普通技术人员来说,在不脱离本申请原理的前提下,还可以对本申请进行若干改进和修饰,这些改进和修饰也落入本申请权利要求的保护范围内。
还需要说明的是,在本说明书中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。

Claims (11)

  1. 一种音频处理方法,其特征在于,包括:
    获取目标干声音频,确定所述目标干声音频中每个歌词字的起止时间;
    检测所述目标干声音频的调高和每段所述起止时间内的基频,并基于所述基频和所述调高确定每个所述歌词字的当前音名;
    对每个所述歌词字分别进行对应的第一音分跨度和多个不同的第二音分跨度的升调处理,分别得到第一和声和多个不同的第二和声;其中,所述第一音分跨度为正整数个音分,多个不同的所述第二音分跨度为所述第一音分跨度与多个不同的第三音分跨度的和,所述第一音分跨度与所述第三音分跨度相差一个数量级;
    将所述第一和声和多个不同的所述第二和声进行合成形成多轨和声;
    混合所述多轨和声和所述目标干声音频,得到合成干声音频。
  2. 根据权利要求1所述音频处理方法,其特征在于,所述检测所述目标干声音频的调高,包括:
    提取所述目标干声音频的音频特征;其中,所述音频特征包括基频特征和频谱信息;
    将所述音频特征输入调高分类器中得到所述目标干声音频的调高。
  3. 根据权利要求1所述音频处理方法,其特征在于,所述基于所述基频和所述调高确定每个所述歌词字的当前音名之后,还包括:
    对所述目标干声音频进行所述第三音分跨度的升调处理得到第三和声;
    相应的,将所述第一和声和多个不同的所述第二和声进行合成形成多轨和声,包括:
    将所述第三和声、所述第一和声和多个不同的所述第二和声进行合成形成多轨和声。
  4. 根据权利要求3所述音频处理方法,其特征在于,将所述第三和声、所述第一和声和多个不同的所述第二和声进行合成形成多轨和声,包括:
    确定所述第三和声、所述第一和声和每个所述第二和声对应的音量和时延;
    按照所述第三和声、所述第一和声和每个所述第二和声对应的音量和时延将所述第三和声、所述第一和声和多个所述第二和声进行合成形成多轨和声。
  5. 根据权利要求1所述音频处理方法,其特征在于,所述混合所述多轨和声和所述目标干声音频,得到合成干声音频之后,还包括:
    利用音效器件为所述合成干声音频增加音效;
    获取所述合成干声音频对应的伴奏音频,将所述伴奏音频与增加音效后的合成干声音频按照预设方式进行叠加得到合成音频。
  6. 根据权利要求5所述音频处理方法,其特征在于,将所述伴奏音频与增加音效后的合成干声音频按照预设方式进行叠加得到合成音频,包括:
    对所述伴奏音频与增加音效后的合成干声音频进行功率归一化处理,得到中间伴奏音频和中间干声音频;
    按照预设的能量比例对所述中间伴奏音频和所述中间干声音频进行叠加得到所述合成音频。
  7. 根据权利要求1至6中任一项所述音频处理方法,其特征在于,所述对每个所述歌词字分别进行对应的第一音分跨度和多个不同的第二音分跨度的升调处理,分别得到第一和声和多个不同的第二和声,包括:
    确定预设音名跨度,并对每个所述歌词字进行预设音名跨度的升调处理得到第一和声;其中,相邻音名相差一个或两个所述第一音分跨度;
    对所述第一和声进行多个不同的所述第三音分跨度的升调处理得到多个不同的第二和声。
  8. 根据权利要求7所述音频处理方法,其特征在于,所述对每个所述歌词字进行预设音名跨度的升调处理得到第一和声,包括:
    根据每个所述歌词字的当前音名和预设音名跨度确定每个歌词字经升调处理后的目标音名;
    基于每个所述歌词字的目标音名与当前音名之间的音分跨度确定每个所述歌词字对应的第一音分跨度数量;
    对每个所述歌词字进行对应数量的第一音分跨度的升调处理得到第一和声。
  9. 一种音频处理装置,其特征在于,包括:
    获取模块,用于获取目标干声音频,确定所述目标干声音频中每个歌词字的起止时间;
    检测模块,用于检测所述目标干声音频的调高和每段所述起止时间内的基频,并基于所述基频和所述调高确定每个所述歌词字的音名;
    升调模块,用于对每个所述歌词字分别进行对应的第一音分跨度和多个不同的第二音分跨度的升调处理,分别得到第一和声和多个不同的第二和声;其中,所述第一音分跨度为正整数个音分,多个不同的所述第二音分跨度为所述第一音分跨度与多个不同的第三音分跨度的和,所述第一音分跨度与所述第三音分跨度相差一个数量级;
    合成模块,用于将所述第一和声和多个不同的所述第二和声进行合成形成多轨和声;
    混合模块,用于混合所述多轨和声和所述目标干声音频,得到合成干声音频。
  10. 一种电子设备,其特征在于,包括:
    存储器,用于存储计算机程序;
    处理器,用于执行所述计算机程序时实现如权利要求1至8任一项所述音频处理方法的步骤。
  11. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质上存储有计算机程序,所述计算机程序被处理器执行时实现如权利要求1至8任一项所述音频处理方法的步骤。
PCT/CN2021/119539 2020-10-28 2021-09-22 音频处理方法、装置及电子设备和计算机可读存储介质 WO2022089097A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US18/034,207 US20230402047A1 (en) 2020-10-28 2021-09-22 Audio processing method and apparatus, electronic device, and computer-readable storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011171384.5 2020-10-28
CN202011171384.5A CN112289300B (zh) 2020-10-28 2020-10-28 音频处理方法、装置及电子设备和计算机可读存储介质

Publications (1)

Publication Number Publication Date
WO2022089097A1 true WO2022089097A1 (zh) 2022-05-05

Family

ID=74372616

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/119539 WO2022089097A1 (zh) 2020-10-28 2021-09-22 音频处理方法、装置及电子设备和计算机可读存储介质

Country Status (3)

Country Link
US (1) US20230402047A1 (zh)
CN (1) CN112289300B (zh)
WO (1) WO2022089097A1 (zh)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112289300B (zh) * 2020-10-28 2024-01-09 腾讯音乐娱乐科技(深圳)有限公司 音频处理方法、装置及电子设备和计算机可读存储介质
CN113035164A (zh) * 2021-02-24 2021-06-25 腾讯音乐娱乐科技(深圳)有限公司 歌声生成方法和装置、电子设备及存储介质
CN115774539A (zh) * 2021-09-06 2023-03-10 北京字跳网络技术有限公司 和声处理方法、装置、设备及介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080262836A1 (en) * 2006-09-04 2008-10-23 National Institute Of Advanced Industrial Science And Technology Pitch estimation apparatus, pitch estimation method, and program
CN108257609A (zh) * 2017-12-05 2018-07-06 北京小唱科技有限公司 音频内容修正的方法及其智能装置
CN109785820A (zh) * 2019-03-01 2019-05-21 腾讯音乐娱乐科技(深圳)有限公司 一种处理方法、装置及设备
CN109920446A (zh) * 2019-03-12 2019-06-21 腾讯音乐娱乐科技(深圳)有限公司 一种音频数据处理方法、装置及计算机存储介质
CN109949783A (zh) * 2019-01-18 2019-06-28 苏州思必驰信息科技有限公司 歌曲合成方法及系统
CN112289300A (zh) * 2020-10-28 2021-01-29 腾讯音乐娱乐科技(深圳)有限公司 音频处理方法、装置及电子设备和计算机可读存储介质

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA2929213C (en) * 2013-10-30 2019-07-09 Music Mastermind, Inc. System and method for enhancing audio, conforming an audio input to a musical key, and creating harmonizing tracks for an audio input
CN108831437B (zh) * 2018-06-15 2020-09-01 百度在线网络技术(北京)有限公司 一种歌声生成方法、装置、终端和存储介质
CN110010162A (zh) * 2019-02-28 2019-07-12 华为技术有限公司 一种歌曲录制方法、修音方法及电子设备
CN111681637B (zh) * 2020-04-28 2024-03-22 平安科技(深圳)有限公司 歌曲合成方法、装置、设备及存储介质

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080262836A1 (en) * 2006-09-04 2008-10-23 National Institute Of Advanced Industrial Science And Technology Pitch estimation apparatus, pitch estimation method, and program
CN108257609A (zh) * 2017-12-05 2018-07-06 北京小唱科技有限公司 音频内容修正的方法及其智能装置
CN109949783A (zh) * 2019-01-18 2019-06-28 苏州思必驰信息科技有限公司 歌曲合成方法及系统
CN109785820A (zh) * 2019-03-01 2019-05-21 腾讯音乐娱乐科技(深圳)有限公司 一种处理方法、装置及设备
CN109920446A (zh) * 2019-03-12 2019-06-21 腾讯音乐娱乐科技(深圳)有限公司 一种音频数据处理方法、装置及计算机存储介质
CN112289300A (zh) * 2020-10-28 2021-01-29 腾讯音乐娱乐科技(深圳)有限公司 音频处理方法、装置及电子设备和计算机可读存储介质

Also Published As

Publication number Publication date
US20230402047A1 (en) 2023-12-14
CN112289300A (zh) 2021-01-29
CN112289300B (zh) 2024-01-09

Similar Documents

Publication Publication Date Title
US10789290B2 (en) Audio data processing method and apparatus, and computer storage medium
WO2022089097A1 (zh) 音频处理方法、装置及电子设备和计算机可读存储介质
CN106898340B (zh) 一种歌曲的合成方法及终端
Tachibana et al. Singing voice enhancement in monaural music signals based on two-stage harmonic/percussive sound separation on multiple resolution spectrograms
CN108831437B (zh) 一种歌声生成方法、装置、终端和存储介质
CN104272382B (zh) 基于模板的个性化歌唱合成的方法和系统
WO2020177190A1 (zh) 一种处理方法、装置及设备
CN111445892B (zh) 歌曲生成方法、装置、可读介质及电子设备
CN112382257B (zh) 一种音频处理方法、装置、设备及介质
US20070289432A1 (en) Creating music via concatenative synthesis
Tachibana et al. Harmonic/percussive sound separation based on anisotropic smoothness of spectrograms
CN105957515A (zh) 声音合成方法、声音合成装置和存储声音合成程序的介质
JP5598516B2 (ja) カラオケ用音声合成システム,及びパラメータ抽出装置
JP7497523B2 (ja) カスタム音色歌声の合成方法、装置、電子機器及び記憶媒体
US20230186782A1 (en) Electronic device, method and computer program
CN112669811B (zh) 一种歌曲处理方法、装置、电子设备及可读存储介质
JP2013164609A (ja) 歌唱合成用データベース生成装置、およびピッチカーブ生成装置
CN112992110B (zh) 音频处理方法、装置、计算设备以及介质
CN112164387A (zh) 音频合成方法、装置及电子设备和计算机可读存储介质
CN114743526A (zh) 音频调整方法、计算机设备和计算机程序产品
JP2013210501A (ja) 素片登録装置,音声合成装置,及びプログラム
CN113421544B (zh) 歌声合成方法、装置、计算机设备及存储介质
CN112071299A (zh) 神经网络模型训练方法、音频生成方法及装置和电子设备
Schwabe et al. Dual task monophonic singing transcription
Bous A neural voice transformation framework for modification of pitch and intensity

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21884827

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 16.08.2023)

122 Ep: pct application non-entry in european phase

Ref document number: 21884827

Country of ref document: EP

Kind code of ref document: A1