CN109147757B - Singing voice synthesis method and device - Google Patents

Singing voice synthesis method and device Download PDF

Info

Publication number
CN109147757B
CN109147757B CN201811056146.2A CN201811056146A CN109147757B CN 109147757 B CN109147757 B CN 109147757B CN 201811056146 A CN201811056146 A CN 201811056146A CN 109147757 B CN109147757 B CN 109147757B
Authority
CN
China
Prior art keywords
word
user
song
voice
frequency
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811056146.2A
Other languages
Chinese (zh)
Other versions
CN109147757A (en
Inventor
劳振锋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Kugou Computer Technology Co Ltd
Original Assignee
Guangzhou Kugou Computer Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Kugou Computer Technology Co Ltd filed Critical Guangzhou Kugou Computer Technology Co Ltd
Priority to CN201811056146.2A priority Critical patent/CN109147757B/en
Publication of CN109147757A publication Critical patent/CN109147757A/en
Application granted granted Critical
Publication of CN109147757B publication Critical patent/CN109147757B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/033Voice editing, e.g. manipulating the voice of the synthesiser

Abstract

The invention discloses a singing voice synthesis method and device, and belongs to the technical field of voice synthesis. The method comprises the following steps: when user voice is acquired, extracting fundamental frequency, envelope and consonant information of each character in the user voice; adjusting the fundamental frequency of each word in the user voice according to the pitch frequency of each word in the song, wherein the pitch frequency of each word in the song is the frequency corresponding to the pitch of each word in the song; synthesizing the adjusted fundamental frequency, the envelope of each word in the user voice and the consonant information to obtain a synthesized audio; and adjusting the time length of each word in the synthesized audio according to the time length of each word in the song to obtain the synthesized singing voice of the user. The invention synthesizes the singing voice of the user by adopting the original envelope and the auxiliary information of the user, can keep the original tone of the user, and the synthesized singing voice of the user is closer to the voice of the user.

Description

Singing voice synthesis method and device
Technical Field
The invention relates to the technical field of voice synthesis, in particular to a singing voice synthesis method and a singing voice synthesis device.
Background
With the development of speech synthesis technology, speech synthesis technology is gradually applied to daily life of people, for example, some users may wish to pronounce lyrics without enough five tones, and then generate their own singing voice, which can be realized by using speech synthesis technology.
At present, the related art generally identifies the speech of the user speaking, correspondingly finds out the inherent singing voice in the speech synthesis database, then extracts the tone of the singing voice, and then changes the tone of the singing voice into the tone of the user by adopting a pre-established conversion model to obtain the synthesized singing voice of the user. Wherein the filter model is used to convert the timbre of the inherent singing voice in the speech synthesis database to the timbre of the user.
The above technology uses the inherent tone in the speech synthesis database to synthesize the singing voice of the user, which can not keep the original tone of the user, and the synthesized singing voice of the user is different from the voice of the user.
Disclosure of Invention
The embodiment of the invention provides a singing voice synthesis method and a singing voice synthesis device, which can solve the problem that the difference between the singing voice of a user synthesized by the related technology and the voice of the user is large. The technical scheme is as follows:
in a first aspect, there is provided a singing voice synthesis method, comprising:
when user voice is acquired, extracting fundamental frequency, envelope and consonant information of each character in the user voice;
adjusting the fundamental frequency of each word in the user voice according to the pitch frequency of each word in the song, wherein the pitch frequency of each word in the song is the frequency corresponding to the pitch of each word in the song;
synthesizing the adjusted fundamental frequency, the envelope of each word in the user voice and the consonant information to obtain a synthesized audio;
and adjusting the time length of each word in the synthesized audio according to the time length of each word in the song to obtain the synthesized singing voice of the user.
In one possible implementation, the adjusting the fundamental frequency of each word in the user speech according to the pitch frequency of each word in the song includes:
and adjusting the fundamental frequency of each word in the user voice to the pitch frequency of the corresponding word in the song according to the pitch frequency of each word in the song.
In one possible implementation, the adjusting the fundamental frequency of each word in the user's speech to the pitch frequency of the corresponding word in the song according to the pitch frequency of each word in the song includes:
and for each word in the song, when the word has a plurality of pitch frequencies, adjusting the fundamental frequency of the word in the user voice according to the sequencing and proportion of the plurality of pitch frequencies.
In one possible implementation, the extracting fundamental frequency, envelope and consonant information of each word in the user speech includes:
and extracting fundamental frequency, envelope and consonant information of each character in the user voice through a feature extraction algorithm, wherein a preset number of fundamental frequencies are extracted from each character, and the preset number is determined according to extraction frequency.
In one possible implementation, the adjusting the fundamental frequency of each word in the user speech according to the pitch frequency of each word in the song includes:
for each word in the user's voice, adjusting the fundamental frequency of the preset number of words to the pitch frequency of the word in the song.
In one possible implementation manner, the adjusting the duration of each word in the synthesized audio according to the duration of each word in the song to obtain the synthesized singing voice of the user includes:
and adjusting the time length of each word in the synthesized audio frequency to the time length of the corresponding word in the song according to the time length of each word in the song to obtain the synthesized singing voice of the user.
In a second aspect, there is provided a singing voice synthesizing apparatus comprising:
the extraction module is used for extracting the fundamental frequency, envelope and consonant information of each character in the user voice when the user voice is acquired;
the adjusting module is used for adjusting the fundamental frequency of each word in the user voice according to the pitch frequency of each word in the song, wherein the pitch frequency of each word in the song is the frequency corresponding to the pitch of each word in the song;
the synthesis module is used for carrying out synthesis processing on the adjusted fundamental frequency, the envelope of each word in the user voice and the consonant information to obtain a synthetic audio;
the adjusting module is further used for adjusting the duration of each word in the synthesized audio according to the duration of each word in the song to obtain the synthesized singing voice of the user.
In one possible implementation, the adjusting module is configured to adjust the fundamental frequency of each word in the user speech to the pitch frequency of the corresponding word in the song according to the pitch frequency of each word in the song.
In one possible implementation, the adjusting module is configured to, for each word in the song, adjust a fundamental frequency of the word in the user speech according to a ranking and a proportion of the plurality of pitch frequencies when the word has a plurality of pitch frequencies.
In a possible implementation manner, the extraction module is configured to extract, through a feature extraction algorithm, fundamental frequency, envelope and consonant information of each word in the user speech, and a preset number of fundamental frequencies are extracted for each word, where the preset number is determined according to an extraction frequency.
In one possible implementation, the adjusting module is configured to, for each word in the user speech, adjust a fundamental frequency of a preset number of the words to a pitch frequency of the words in the song.
In a possible implementation manner, the adjusting module is configured to adjust the duration of each word in the synthesized audio to the duration of the corresponding word in the song according to the duration of each word in the song, so as to obtain the synthesized singing voice of the user.
In a third aspect, a computer device is provided, comprising a processor and a memory; the memory is used for storing a computer program; the processor is configured to execute the computer program stored in the memory to implement the method steps of any one of the implementation manners of the first aspect.
In a fourth aspect, a computer-readable storage medium is provided, in which a computer program is stored which, when being executed by a processor, carries out the method steps of any of the implementations of the first aspect.
The technical scheme provided by the embodiment of the invention has the beneficial effects that at least:
the fundamental frequency of each word spoken by a user is adjusted according to the pitch frequency of each word in a song, the adjusted fundamental frequency, the original envelope of the user and auxiliary information are synthesized into audio, and the time length of each word spoken by the user is adjusted according to the time length of each word in the song, so that the singing voice of the user is synthesized. The above scheme adopts the original envelope and auxiliary information of the user to synthesize the singing voice of the user, so that the original tone of the user can be reserved, and the synthesized singing voice of the user is closer to the voice of the user.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a flowchart of a singing voice synthesizing method according to an embodiment of the present invention;
fig. 2 is a flowchart of a singing voice synthesizing method according to an embodiment of the present invention;
FIG. 3 is a diagram illustrating a correspondence between pitch and frequency provided by an embodiment of the present invention;
fig. 4 is a schematic structural diagram of a singing voice synthesizing apparatus according to an embodiment of the present invention;
fig. 5 is a schematic structural diagram of an electronic device 500 according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be described in detail with reference to the accompanying drawings.
Fig. 1 is a flowchart of a singing voice synthesizing method according to an embodiment of the present invention. Referring to fig. 1, the method includes:
101. when the user voice is acquired, extracting fundamental frequency, envelope and consonant information of each character in the user voice.
102. And adjusting the fundamental frequency of each word in the voice of the user according to the pitch frequency of each word in the song, wherein the pitch frequency of each word in the song is the frequency corresponding to the pitch of each word in the song.
103. And synthesizing the adjusted fundamental frequency, the envelope of each word in the user voice and the consonant information to obtain a synthesized audio.
104. And adjusting the time length of each character in the synthesized audio according to the time length of each character in the song to obtain the synthesized singing voice of the user.
According to the method provided by the embodiment of the invention, after the fundamental frequency of each word spoken by a user is adjusted according to the pitch frequency of each word in a song, the adjusted fundamental frequency, the original envelope of the user and auxiliary information are synthesized into audio, and then the time length of each word spoken by the user is adjusted according to the time length of each word in the song, so that the singing voice of the user is synthesized. The above scheme adopts the original envelope and auxiliary information of the user to synthesize the singing voice of the user, so that the original tone of the user can be reserved, and the synthesized singing voice of the user is closer to the voice of the user.
In one possible implementation, the adjusting the fundamental frequency of each word in the user's speech according to the pitch frequency of each word in the song includes:
and adjusting the fundamental frequency of each word in the voice of the user to the pitch frequency of the corresponding word in the song according to the pitch frequency of each word in the song.
In one possible implementation, the adjusting the fundamental frequency of each word in the user's speech to the pitch frequency of the corresponding word in the song according to the pitch frequency of each word in the song includes:
and for each word in the song, when the word has a plurality of pitch frequencies, adjusting the fundamental frequency of the word in the voice of the user according to the sequence and the proportion of the plurality of pitch frequencies.
In one possible implementation, the extracting fundamental frequency, envelope and consonant information of each word in the user speech includes:
and extracting fundamental frequency, envelope and consonant information of each character in the user voice through a feature extraction algorithm, wherein each character extracts a preset number of fundamental frequencies, and the preset number is determined according to the extraction frequency.
In one possible implementation, the adjusting the fundamental frequency of each word in the user's speech according to the pitch frequency of each word in the song includes:
for each word in the user's voice, the preset number of fundamental frequencies of the word is adjusted to the pitch frequency of the word in the song.
In one possible implementation, the adjusting the duration of each word in the synthesized audio according to the duration of each word in the song to obtain the synthesized user singing voice includes:
and adjusting the time length of each word in the synthesized audio to the time length of the corresponding word in the song according to the time length of each word in the song to obtain the synthesized singing voice of the user.
All the above-mentioned optional technical solutions can be combined arbitrarily to form the optional embodiments of the present invention, and are not described herein again.
Fig. 2 is a flowchart of a singing voice synthesizing method according to an embodiment of the present invention. The method is performed by an electronic device, see fig. 2, the method comprising:
201. when the user voice is acquired, extracting fundamental frequency, envelope and consonant information of each character in the user voice.
In the embodiment of the invention, the user can input the user voice to the electronic equipment, for example, the electronic equipment can be provided with a specified application which has a singing voice synthesis function. When a user wants to synthesize the singing voice of the user, the electronic equipment can be triggered to display the voice input interface of the appointed application through corresponding operation, and in the process of displaying the voice input interface, the user can speak to the electronic equipment, for example, the lyrics of a certain song are pronounced, so that the electronic equipment can acquire the voice of the user.
Further, the electronic device may extract features of the user speech including fundamental frequency, envelope, and consonant information for each word in the user speech. In one possible implementation manner, the electronic device may extract fundamental frequency, envelope and consonant information of each word in the user speech through a feature extraction algorithm, and each word extracts a preset number of fundamental frequencies, where the preset number is determined according to an extraction frequency.
For example, the electronic device may extract features of the user speech through a feature extraction algorithm included in the world tool, where the feature extraction algorithm may include a fundamental frequency extraction algorithm, an envelope extraction algorithm, and a consonant extraction algorithm, and extract the user speech through each extraction algorithm to obtain corresponding features, for example, extracting fundamental frequency information of the user speech through the fundamental frequency extraction algorithm, extracting envelope information of the user speech through the envelope extraction algorithm, and extracting consonant information of the user speech through the consonant extraction algorithm.
202. And adjusting the fundamental frequency of each word in the voice of the user to the pitch frequency of the corresponding word in the song according to the pitch frequency of each word in the song.
And the pitch frequency of each word in the song is the frequency corresponding to the pitch of each word in the song. Referring to fig. 3, fig. 3 is a schematic diagram of correspondence between a pitch and a frequency according to an embodiment of the present invention, where the pitch of each word in a song may be converted into the corresponding frequency according to rules of the twelve-tone rule, for example, there is a correspondence between the pitch of the first column and the frequency of the fourth column in fig. 3, and the electronic device may convert the pitch of each word in the song into the corresponding frequency according to the correspondence, so as to obtain the pitch frequency of each word.
In an embodiment of the present invention, for each word in the user's voice, the electronic device may adjust the fundamental frequency of the word to the pitch frequency of the corresponding word in the song. Wherein, for each word in the user's voice, the corresponding word in the song is the same word as the word. Of course, the corresponding word may be only the same pronunciation as the word, and the embodiment of the present invention does not limit this.
For each word in the user speech, the electronic device may adjust the fundamental frequencies of the preset number of words to the pitch frequency of the corresponding word in the song.
In one possible implementation, for each word in the song, when the word has a plurality of pitch frequencies, the fundamental frequency of the word in the user's speech is adjusted according to the ordering and proportion of the plurality of pitch frequencies. Taking word a in a song as an example, suppose word a has three frequencies, frequency 1, frequency 2 and frequency 3, and the ordering is also frequency 1- > frequency 2- > frequency 3, with frequency 1 being the front and accounting for 50%, frequency 2 being the middle and accounting for 30%, and frequency 3 being the back and accounting for 20%. The electronic device may adjust the first 50% of the frequency of the word a in the user's speech to frequency 1, the middle 30% to frequency 2, and the last 20% to frequency 3.
It should be noted that this step 202 is one possible implementation of adjusting the fundamental frequency of each word in the user's speech according to the pitch frequency of each word in the song. Since the fundamental frequency determines the pitch, the pitch of each word in the user's voice is made the same as the pitch of the corresponding word in the song by adjusting the fundamental frequency of each word in the user's voice to the pitch frequency of the word in the song.
203. And synthesizing the adjusted fundamental frequency, the envelope of each word in the user voice and the consonant information to obtain a synthesized audio.
In the embodiment of the invention, after the electronic equipment adjusts the fundamental frequency of each word in the user voice, the adjusted fundamental frequency and the original envelope and consonant information of the user can be synthesized into audio (or audio).
In one possible implementation, the electronic device may perform the step of synthesizing processing using a world tool, for example, the world tool may contain a speech synthesis algorithm, and accordingly, the computer device may synthesize the adjusted fundamental frequency and envelope of the words in the user's speech and consonant information into audio through the speech synthesis algorithm. Because the audio is synthesized by adopting the original envelope of the user and the auxiliary information, and the envelope determines the tone of the user, the mode of synthesizing the audio can reserve the original tone of the user.
204. And adjusting the time length of each word in the synthesized audio to the time length of the corresponding word in the song according to the time length of each word in the song to obtain the synthesized singing voice of the user.
In this embodiment of the present invention, the synthesized audio obtained in step 203 is only the pitch of the word matched with the pitch of the word in the song, and for each word in the synthesized audio, the electronic device may adjust the duration of the word to the duration of the corresponding word in the song through a speed change algorithm, such as a sound speed change algorithm, in order to obtain the singing voice of the user. After the electronic equipment tunes and integrates the duration of the words in the audio, the synthesized singing voice of the user is obtained, namely, the voice of the user speaking is changed into the singing voice.
It should be noted that, the duration of each word in the synthesized audio is adjusted according to the duration of each word in the song, so as to obtain a possible implementation manner of the synthesized user singing voice. The synthesized audio becomes the user's singing voice by adjusting the duration of each word in the synthesized audio to the duration of the word in the song. Because the original envelope and the auxiliary information of the user are adopted to synthesize the singing voice of the user, the original tone of the user can be reserved, and the synthesized singing voice of the user is closer to the voice of the user.
According to the technical scheme, the fundamental frequency, the envelope and the consonant information of each word spoken by a user are extracted, the corresponding frequency of each word spoken by the user is changed into the pitch frequency of the words of the song according to the pitch frequency of each word of the song, then the fundamental frequency, the envelope and the original fundamental frequency are synthesized, and the speed of each word spoken by the user is changed according to the time length of each word of the song. Because the envelope and consonant information used by the scheme are original to the user, the tone color is closer to the voice of the user, and the synthesized singing voice is more natural.
According to the method provided by the embodiment of the invention, after the fundamental frequency of each word spoken by a user is adjusted according to the pitch frequency of each word in a song, the adjusted fundamental frequency, the original envelope of the user and auxiliary information are synthesized into audio, and then the time length of each word spoken by the user is adjusted according to the time length of each word in the song, so that the singing voice of the user is synthesized. The above scheme adopts the original envelope and auxiliary information of the user to synthesize the singing voice of the user, so that the original tone of the user can be reserved, and the synthesized singing voice of the user is closer to the voice of the user.
Fig. 4 is a schematic structural diagram of a singing voice synthesizing apparatus according to an embodiment of the present invention. Referring to fig. 4, the apparatus includes:
the extracting module 401 is configured to, when a user voice is obtained, extract fundamental frequency, envelope and consonant information of each word in the user voice;
an adjusting module 402, configured to adjust a fundamental frequency of each word in the user speech according to a pitch frequency of each word in a song, where the pitch frequency of each word in the song is a frequency corresponding to the pitch of each word in the song;
a synthesis module 403, configured to perform synthesis processing on the adjusted fundamental frequency, the envelope of each word in the user speech, and the consonant information to obtain a synthesized audio;
the adjusting module 402 is further configured to adjust the duration of each word in the synthesized audio according to the duration of each word in the song, so as to obtain the synthesized user singing voice.
In one possible implementation, the adjusting module 402 is configured to adjust the fundamental frequency of each word in the user's voice to the pitch frequency of the corresponding word in the song according to the pitch frequency of each word in the song.
In one possible implementation, the adjusting module 402 is configured to, for each word in the song, adjust a fundamental frequency of the word in the user's speech according to a ranking and a proportion of a plurality of pitch frequencies when the word has the plurality of pitch frequencies.
In a possible implementation manner, the extraction module 401 is configured to extract fundamental frequency, envelope and consonant information of each word in the user speech through a feature extraction algorithm, where each word extracts a preset number of fundamental frequencies, and the preset number is determined according to an extraction frequency.
In one possible implementation, the adjusting module 402 is configured to adjust a preset number of fundamental frequencies of each word in the user's voice to a pitch frequency of the word in the song.
In one possible implementation manner, the adjusting module 402 is configured to adjust the duration of each word in the synthesized audio to the duration of the corresponding word in the song according to the duration of each word in the song, so as to obtain the synthesized singing voice of the user.
In the embodiment of the invention, the fundamental frequency of each word spoken by a user is adjusted according to the pitch frequency of each word in a song, the adjusted fundamental frequency, the original envelope of the user and auxiliary information are synthesized into audio, and the duration of each word spoken by the user is adjusted according to the duration of each word in the song, so that the singing voice of the user is synthesized. The above scheme adopts the original envelope and auxiliary information of the user to synthesize the singing voice of the user, so that the original tone of the user can be reserved, and the synthesized singing voice of the user is closer to the voice of the user.
It should be noted that: in the singing voice synthesizing apparatus provided in the above embodiment, only the division of the above functional modules is used for illustration when the singing voice is synthesized, and in practical applications, the above functions may be distributed by different functional modules according to needs, that is, the internal structure of the device may be divided into different functional modules to complete all or part of the above described functions. In addition, the singing voice synthesizing device provided by the above embodiment and the singing voice synthesizing method embodiment belong to the same concept, and the specific implementation process is detailed in the method embodiment and is not described again.
Fig. 5 is a schematic structural diagram of an electronic device 500 according to an embodiment of the present invention. The electronic device 500 may be: a smart phone, a tablet computer, an MP3 player (Moving Picture Experts Group Audio Layer III, motion video Experts compression standard Audio Layer 3), an MP4 player (Moving Picture Experts Group Audio Layer IV, motion video Experts compression standard Audio Layer 4), a notebook computer, or a desktop computer. The electronic device 500 may also be referred to by other names as user equipment, portable electronic device, laptop electronic device, desktop electronic device, and so on.
In general, the electronic device 500 includes: a processor 501 and a memory 502.
The processor 501 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and so on. The processor 501 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field-Programmable Gate Array), and a PLA (Programmable Logic Array). The processor 501 may also include a main processor and a coprocessor, where the main processor is a processor for Processing data in an awake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 501 may be integrated with a GPU (Graphics Processing Unit), which is responsible for rendering and drawing the content required to be displayed on the display screen. In some embodiments, processor 501 may also include an AI (Artificial Intelligence) processor for processing computational operations related to machine learning.
Memory 502 may include one or more computer-readable storage media, which may be non-transitory. Memory 502 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 502 is used to store at least one instruction for execution by processor 501 to implement the singing voice synthesis methods provided by method embodiments herein.
In some embodiments, the electronic device 500 may further optionally include: a peripheral interface 503 and at least one peripheral. The processor 501, memory 502 and peripheral interface 503 may be connected by a bus or signal lines. Each peripheral may be connected to the peripheral interface 503 by a bus, signal line, or circuit board. Specifically, the peripheral device includes: at least one of radio frequency circuitry 504, display screen 505, camera 506, audio circuitry 507, positioning components 508, and power supply 509.
The peripheral interface 503 may be used to connect at least one peripheral related to I/O (Input/Output) to the processor 501 and the memory 502. In some embodiments, the processor 501, memory 502, and peripheral interface 503 are integrated on the same chip or circuit board; in some other embodiments, any one or two of the processor 501, the memory 502, and the peripheral interface 503 may be implemented on a separate chip or circuit board, which is not limited in this embodiment.
The Radio Frequency circuit 504 is used for receiving and transmitting RF (Radio Frequency) signals, also called electromagnetic signals. The radio frequency circuitry 504 communicates with communication networks and other communication devices via electromagnetic signals. The rf circuit 504 converts an electrical signal into an electromagnetic signal to transmit, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 504 includes: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber identity module card, and so forth. The radio frequency circuitry 504 may communicate with other electronic devices via at least one wireless communication protocol. The wireless communication protocols include, but are not limited to: metropolitan area networks, various generation mobile communication networks (2G, 3G, 4G, and 5G), Wireless local area networks, and/or WiFi (Wireless Fidelity) networks. In some embodiments, the rf circuit 504 may further include NFC (Near Field Communication) related circuits, which are not limited in this application.
The display screen 505 is used to display a UI (user interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display screen 505 is a touch display screen, the display screen 505 also has the ability to capture touch signals on or over the surface of the display screen 505. The touch signal may be input to the processor 501 as a control signal for processing. At this point, the display screen 505 may also be used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard. In some embodiments, the display screen 505 may be one, providing the front panel of the electronic device 500; in other embodiments, the display screens 505 may be at least two, respectively disposed on different surfaces of the electronic device 500 or in a folded design; in still other embodiments, the display 505 may be a flexible display disposed on a curved surface or on a folded surface of the electronic device 500. Even more, the display screen 505 can be arranged in a non-rectangular irregular figure, i.e. a shaped screen. The Display screen 505 may be made of LCD (Liquid Crystal Display), OLED (Organic Light-Emitting Diode), and other materials.
The camera assembly 506 is used to capture images or video. Optionally, camera assembly 506 includes a front camera and a rear camera. Generally, a front camera is disposed on a front panel of an electronic apparatus, and a rear camera is disposed on a rear surface of the electronic apparatus. In some embodiments, the number of the rear cameras is at least two, and each rear camera is any one of a main camera, a depth-of-field camera, a wide-angle camera and a telephoto camera, so that the main camera and the depth-of-field camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize panoramic shooting and VR (Virtual Reality) shooting functions or other fusion shooting functions. In some embodiments, camera assembly 506 may also include a flash. The flash lamp can be a monochrome temperature flash lamp or a bicolor temperature flash lamp. The double-color-temperature flash lamp is a combination of a warm-light flash lamp and a cold-light flash lamp, and can be used for light compensation at different color temperatures.
Audio circuitry 507 may include a microphone and a speaker. The microphone is used for collecting sound waves of a user and the environment, converting the sound waves into electric signals, and inputting the electric signals to the processor 501 for processing, or inputting the electric signals to the radio frequency circuit 504 to realize voice communication. For stereo capture or noise reduction purposes, the microphones may be multiple and disposed at different locations of the electronic device 500. The microphone may also be an array microphone or an omni-directional pick-up microphone. The speaker is used to convert electrical signals from the processor 501 or the radio frequency circuit 504 into sound waves. The loudspeaker can be a traditional film loudspeaker or a piezoelectric ceramic loudspeaker. When the speaker is a piezoelectric ceramic speaker, the speaker can be used for purposes such as converting an electric signal into a sound wave audible to a human being, or converting an electric signal into a sound wave inaudible to a human being to measure a distance. In some embodiments, audio circuitry 507 may also include a headphone jack.
The positioning component 508 is used to locate the current geographic Location of the electronic device 500 for navigation or LBS (Location Based Service). The Positioning component 508 may be a Positioning component based on the united states GPS (Global Positioning System), the chinese beidou System, the russian graves System, or the european union's galileo System.
The power supply 509 is used to power the various components in the electronic device 500. The power source 509 may be alternating current, direct current, disposable or rechargeable. When power supply 509 includes a rechargeable battery, the rechargeable battery may support wired or wireless charging. The rechargeable battery may also be used to support fast charge technology.
In some embodiments, the electronic device 500 also includes one or more sensors 510. The one or more sensors 510 include, but are not limited to: acceleration sensor 511, gyro sensor 512, pressure sensor 513, fingerprint sensor 514, optical sensor 515, and proximity sensor 516.
The acceleration sensor 511 may detect the magnitude of acceleration on three coordinate axes of a coordinate system established with the electronic device 500. For example, the acceleration sensor 511 may be used to detect components of the gravitational acceleration in three coordinate axes. The processor 501 may control the touch screen 505 to display the user interface in a landscape view or a portrait view according to the gravitational acceleration signal collected by the acceleration sensor 511. The acceleration sensor 511 may also be used for acquisition of motion data of a game or a user.
The gyro sensor 512 may detect a body direction and a rotation angle of the electronic device 500, and the gyro sensor 512 may cooperate with the acceleration sensor 511 to acquire a 3D motion of the user on the electronic device 500. The processor 501 may implement the following functions according to the data collected by the gyro sensor 512: motion sensing (such as changing the UI according to a user's tilting operation), image stabilization at the time of photographing, game control, and inertial navigation.
The pressure sensors 513 may be disposed on a side bezel of the electronic device 500 and/or on an underlying layer of the touch display screen 505. When the pressure sensor 513 is disposed on the side frame of the electronic device 500, the holding signal of the user to the electronic device 500 can be detected, and the processor 501 performs left-right hand recognition or shortcut operation according to the holding signal collected by the pressure sensor 513. When the pressure sensor 513 is disposed at the lower layer of the touch display screen 505, the processor 501 controls the operability control on the UI interface according to the pressure operation of the user on the touch display screen 505. The operability control comprises at least one of a button control, a scroll bar control, an icon control and a menu control.
The fingerprint sensor 514 is used for collecting a fingerprint of the user, and the processor 501 identifies the identity of the user according to the fingerprint collected by the fingerprint sensor 514, or the fingerprint sensor 514 identifies the identity of the user according to the collected fingerprint. Upon recognizing that the user's identity is a trusted identity, the processor 501 authorizes the user to perform relevant sensitive operations including unlocking the screen, viewing encrypted information, downloading software, paying, and changing settings, etc. The fingerprint sensor 514 may be disposed on the front, back, or side of the electronic device 500. When a physical button or vendor Logo is provided on the electronic device 500, the fingerprint sensor 514 may be integrated with the physical button or vendor Logo.
The optical sensor 515 is used to collect the ambient light intensity. In one embodiment, the processor 501 may control the display brightness of the touch display screen 505 based on the ambient light intensity collected by the optical sensor 515. Specifically, when the ambient light intensity is high, the display brightness of the touch display screen 505 is increased; when the ambient light intensity is low, the display brightness of the touch display screen 505 is turned down. In another embodiment, processor 501 may also dynamically adjust the shooting parameters of camera head assembly 506 based on the ambient light intensity collected by optical sensor 515.
A proximity sensor 516, also known as a distance sensor, is typically disposed on the front panel of the electronic device 500. The proximity sensor 516 is used to capture the distance between the user and the front of the electronic device 500. In one embodiment, when the proximity sensor 516 detects that the distance between the user and the front surface of the electronic device 500 gradually decreases, the processor 501 controls the touch display screen 505 to switch from the bright screen state to the dark screen state; when the proximity sensor 516 detects that the distance between the user and the front surface of the electronic device 500 becomes gradually larger, the processor 501 controls the touch display screen 505 to switch from the screen-on state to the screen-on state.
Those skilled in the art will appreciate that the configuration shown in fig. 5 is not intended to be limiting of the electronic device 500 and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components may be used.
In an exemplary embodiment, there is also provided a computer-readable storage medium, such as a memory, storing a computer program which, when executed by a processor, implements the singing voice synthesizing method in the above-described embodiments. For example, the computer-readable storage medium may be a Read-Only Memory (ROM), a Random Access Memory (RAM), a Compact Disc Read-Only Memory (CD-ROM), a magnetic tape, a floppy disk, an optical data storage device, and the like.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, and the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims (8)

1. A method of synthesizing singing voice, the method comprising:
when user voice is obtained, extracting fundamental frequency, envelope and consonant information of each character in the user voice through a feature extraction algorithm, wherein each character extracts a preset number of fundamental frequencies, the preset number is determined according to extraction frequency, and the user voice is voice of lyrics of a song pronounced by a user;
for each word in the user voice, adjusting the fundamental frequency of the preset number of the words to the pitch frequency of the words in the song, wherein the pitch frequency of each word in the song is the frequency corresponding to the pitch of each word in the song;
synthesizing the adjusted fundamental frequency, the envelope of each word in the user voice and the consonant information to obtain a synthesized audio;
and adjusting the time length of each word in the synthesized audio according to the time length of each word in the song to obtain the synthesized singing voice of the user.
2. The method of claim 1, further comprising:
and for each word in the song, when the word has a plurality of pitch frequencies, adjusting the fundamental frequency of the word in the user voice according to the sequencing and proportion of the plurality of pitch frequencies.
3. The method of claim 1, wherein said adjusting the duration of each word in the synthesized audio based on the duration of each word in the song to obtain a synthesized user singing comprises:
and adjusting the time length of each word in the synthesized audio frequency to the time length of the corresponding word in the song according to the time length of each word in the song to obtain the synthesized singing voice of the user.
4. A singing voice synthesizing apparatus, characterized in that the apparatus comprises:
the extraction module is used for extracting the fundamental frequency, envelope and consonant information of each character in the user voice through a feature extraction algorithm when the user voice is obtained, extracting a preset number of fundamental frequencies from each character, wherein the preset number is determined according to extraction frequency, and the user voice is the voice of lyrics of a song pronounced by the user;
an adjusting module, configured to adjust, for each word in the user speech, a preset number of fundamental frequencies of the words to a pitch frequency of the words in the song, where the pitch frequency of each word in the song is a frequency corresponding to a pitch of each word in the song;
the synthesis module is used for carrying out synthesis processing on the adjusted fundamental frequency, the envelope of each word in the user voice and the consonant information to obtain a synthetic audio;
the adjusting module is further used for adjusting the duration of each word in the synthesized audio according to the duration of each word in the song to obtain the synthesized singing voice of the user.
5. The apparatus of claim 4, wherein the adjustment module is configured to, for each word in the song, adjust a fundamental frequency of the word in the user speech according to a ranking and a proportion of a plurality of pitch frequencies when the word has the plurality of pitch frequencies.
6. The apparatus of claim 4, wherein the adjusting module is configured to adjust the duration of each word in the synthesized audio to the duration of the corresponding word in the song according to the duration of each word in the song, so as to obtain the synthesized singing voice of the user.
7. An electronic device comprising a processor and a memory; the memory is used for storing a computer program; the processor, configured to execute the computer program stored on the memory, implements the method steps of any of claims 1-3.
8. A computer-readable storage medium, characterized in that a computer program is stored in the computer-readable storage medium, which computer program, when being executed by a processor, carries out the method steps of any one of claims 1-3.
CN201811056146.2A 2018-09-11 2018-09-11 Singing voice synthesis method and device Active CN109147757B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811056146.2A CN109147757B (en) 2018-09-11 2018-09-11 Singing voice synthesis method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811056146.2A CN109147757B (en) 2018-09-11 2018-09-11 Singing voice synthesis method and device

Publications (2)

Publication Number Publication Date
CN109147757A CN109147757A (en) 2019-01-04
CN109147757B true CN109147757B (en) 2021-07-02

Family

ID=64824403

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811056146.2A Active CN109147757B (en) 2018-09-11 2018-09-11 Singing voice synthesis method and device

Country Status (1)

Country Link
CN (1) CN109147757B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110148394B (en) * 2019-04-26 2024-03-01 平安科技(深圳)有限公司 Singing voice synthesizing method, singing voice synthesizing device, computer equipment and storage medium
CN112417201A (en) * 2019-08-22 2021-02-26 北京峰趣互联网信息服务有限公司 Audio information pushing method and system, electronic equipment and computer readable medium
CN110600034B (en) * 2019-09-12 2021-12-03 广州酷狗计算机科技有限公司 Singing voice generation method, singing voice generation device, singing voice generation equipment and storage medium
CN112951198A (en) * 2019-11-22 2021-06-11 微软技术许可有限责任公司 Singing voice synthesis
CN111091807B (en) * 2019-12-26 2023-05-26 广州酷狗计算机科技有限公司 Speech synthesis method, device, computer equipment and storage medium
CN111402842B (en) * 2020-03-20 2021-11-19 北京字节跳动网络技术有限公司 Method, apparatus, device and medium for generating audio
CN111681637B (en) * 2020-04-28 2024-03-22 平安科技(深圳)有限公司 Song synthesis method, device, equipment and storage medium

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP3815347B2 (en) * 2002-02-27 2006-08-30 ヤマハ株式会社 Singing synthesis method and apparatus, and recording medium
CN1703735A (en) * 2002-07-29 2005-11-30 埃森图斯有限责任公司 System and method for musical sonification of data
FR2852778B1 (en) * 2003-03-21 2005-07-22 Cit Alcatel TERMINAL OF TELECOMMUNICATION
CN100524456C (en) * 2003-08-06 2009-08-05 雅马哈株式会社 Singing voice synthesizing method
CN101727902B (en) * 2008-10-29 2011-08-10 中国科学院自动化研究所 Method for estimating tone
CN104464725B (en) * 2014-12-30 2017-09-05 福建凯米网络科技有限公司 A kind of method and apparatus imitated of singing
CN106971703A (en) * 2017-03-17 2017-07-21 西北师范大学 A kind of song synthetic method and device based on HMM
CN106898340B (en) * 2017-03-30 2021-05-28 腾讯音乐娱乐(深圳)有限公司 Song synthesis method and terminal

Also Published As

Publication number Publication date
CN109147757A (en) 2019-01-04

Similar Documents

Publication Publication Date Title
CN109147757B (en) Singing voice synthesis method and device
CN108538302B (en) Method and apparatus for synthesizing audio
CN110933330A (en) Video dubbing method and device, computer equipment and computer-readable storage medium
CN108965757B (en) Video recording method, device, terminal and storage medium
CN108965922B (en) Video cover generation method and device and storage medium
CN110688082B (en) Method, device, equipment and storage medium for determining adjustment proportion information of volume
CN109192218B (en) Method and apparatus for audio processing
CN109346111B (en) Data processing method, device, terminal and storage medium
CN110956971B (en) Audio processing method, device, terminal and storage medium
CN109635133B (en) Visual audio playing method and device, electronic equipment and storage medium
CN108831425B (en) Sound mixing method, device and storage medium
CN111061405B (en) Method, device and equipment for recording song audio and storage medium
CN109192223B (en) Audio alignment method and device
CN109003621B (en) Audio processing method and device and storage medium
CN109547843B (en) Method and device for processing audio and video
CN109065068B (en) Audio processing method, device and storage medium
CN111415650A (en) Text-to-speech method, device, equipment and storage medium
CN109102811B (en) Audio fingerprint generation method and device and storage medium
CN110600034B (en) Singing voice generation method, singing voice generation device, singing voice generation equipment and storage medium
CN111081277B (en) Audio evaluation method, device, equipment and storage medium
CN111276122A (en) Audio generation method and device and storage medium
CN110867194B (en) Audio scoring method, device, equipment and storage medium
CN113420177A (en) Audio data processing method and device, computer equipment and storage medium
CN109036463B (en) Method, device and storage medium for acquiring difficulty information of songs
CN109448676B (en) Audio processing method, device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant