CN111028823B - Audio generation method, device, computer readable storage medium and computing equipment - Google Patents

Audio generation method, device, computer readable storage medium and computing equipment Download PDF

Info

Publication number
CN111028823B
CN111028823B CN201911267158.4A CN201911267158A CN111028823B CN 111028823 B CN111028823 B CN 111028823B CN 201911267158 A CN201911267158 A CN 201911267158A CN 111028823 B CN111028823 B CN 111028823B
Authority
CN
China
Prior art keywords
audio
phoneme
pronunciation information
target
sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911267158.4A
Other languages
Chinese (zh)
Other versions
CN111028823A (en
Inventor
肖纯智
劳振锋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Kugou Computer Technology Co Ltd
Original Assignee
Guangzhou Kugou Computer Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Kugou Computer Technology Co Ltd filed Critical Guangzhou Kugou Computer Technology Co Ltd
Priority to CN201911267158.4A priority Critical patent/CN111028823B/en
Publication of CN111028823A publication Critical patent/CN111028823A/en
Application granted granted Critical
Publication of CN111028823B publication Critical patent/CN111028823B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/033Voice editing, e.g. manipulating the voice of the synthesiser
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/033Voice editing, e.g. manipulating the voice of the synthesiser
    • G10L13/0335Pitch control

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

The application relates to an audio generation method, an audio generation device, a computer readable storage medium and computing equipment, and belongs to the field of electronic technology application. The method comprises the following steps: acquiring a plurality of pronunciation information, wherein the plurality of pronunciation information comprises at least one first pronunciation information, and each first pronunciation information comprises: the method comprises the steps of corresponding pitch of an audio frame, content of a target phoneme corresponding to the corresponding audio frame, content of adjacent phonemes of the target phoneme and a continuous tone indicator, wherein the adjacent phonemes of any target phoneme comprise a previous phoneme and a next phoneme of any target phoneme, and the continuous tone indicator is used for indicating whether continuous tones exist between the target phoneme and the adjacent phonemes in the pronunciation information; and inputting the plurality of pronunciation information into the audio synthesis model to obtain target audio output by the audio synthesis model, wherein an audio frame corresponding to each of the plurality of pronunciation information is one audio frame in the target audio. The application can improve the quality of the output audio.

Description

Audio generation method, device, computer readable storage medium and computing equipment
Technical Field
The present application relates to the field of electronic technology application, and in particular, to an audio generating method, an audio generating device, a computer readable storage medium and a computing device.
Background
The audio synthesis model is a model for performing audio synthesis. Audio such as songs can be synthesized through the audio synthesis model.
The current process of generating audio using an audio synthesis model includes: and obtaining an audio synthesis model through a model training process, inputting a plurality of pronunciation information (conditions) into the audio synthesis model, and outputting target audio by the audio synthesis model. The plurality of pronunciation information is in one-to-one correspondence with a plurality of audio frames included in the output target audio, and each pronunciation information is used for describing audio characteristics of the corresponding audio frame. In general, each pronunciation information includes: the pitch of the corresponding audio frame, the content of the target phoneme corresponding to the corresponding audio frame, the content of the previous phoneme of the target phoneme and the content of the next phoneme.
However, the songs sung by the real person are actually formed by the change of the human acoustic cavity, and the songs generated by the audio synthesis model cannot effectively reflect the change process of the human acoustic cavity, so that the quality of the output audio is poor.
Disclosure of Invention
The embodiment of the application provides an audio generation method, an audio generation device, a computer readable storage medium and computing equipment, which can improve the quality of generated audio. The technical scheme is as follows:
according to a first aspect of an embodiment of the present application, there is provided an audio generating method including:
Acquiring a plurality of pieces of pronunciation information, wherein the plurality of pieces of pronunciation information comprise at least one piece of first pronunciation information, and each piece of first pronunciation information comprises: the method comprises the steps of enabling a pitch of a corresponding audio frame, content of a target phoneme corresponding to the corresponding audio frame, content of adjacent phonemes of the target phoneme and a continuous tone indicator, wherein the adjacent phonemes of any target phoneme comprise a previous phoneme and a next phoneme of any target phoneme, the continuous tone indicator is used for indicating whether continuous tones exist between the target phoneme in sound information and the adjacent phonemes of the target phoneme, and the audio frame corresponding to each sound information in a plurality of sound information is one audio frame in the target audio;
And inputting the plurality of pronunciation information into an audio synthesis model to obtain target audio output by the audio synthesis model.
Optionally, before the acquiring the plurality of pronunciation information, the method further includes:
Analyzing the sample audio to obtain a plurality of sample pronunciation information, wherein the plurality of sample pronunciation information comprises at least one second pronunciation information, and each second pronunciation information comprises: the method comprises the steps of enabling a pitch of a corresponding audio frame, content of a target phoneme corresponding to the corresponding audio frame, content of adjacent phonemes of the target phoneme and a continuous tone indicator, wherein the audio frame corresponding to each sample pronunciation information in the plurality of sample pronunciation information is one audio frame in the sample audio;
and performing model training based on the plurality of sample pronunciation information to obtain the audio synthesis model.
Optionally, the analyzing the sample audio to obtain a plurality of sample pronunciation information includes:
Acquiring the pitch of each audio frame in the sample audio;
Detecting whether continuous tones exist between each phoneme and adjacent phonemes in the sample audio to obtain continuous tone detection results;
And generating the plurality of sample pronunciation information based on the pitch of each audio frame and the continuous sound detection result.
Optionally, the detecting whether the continuous sound exists between each phoneme and the adjacent phonemes in the sample audio, to obtain a continuous sound detection result includes:
when M adjacent audio frames before and N adjacent audio frames after the starting point of a sample audio frame set corresponding to any phoneme in the sample audio are all pitch frames, determining that the front continuous sound exists in any phoneme, wherein the pitch frames are audio frames with the pitch being more than 0, N and M are positive integers, and the sample audio frame set corresponding to any phoneme is a set of audio frames formed by any phoneme in the sample audio in the pronunciation process;
and when M adjacent audio frames before and N adjacent audio frames after the end point of the sample audio frame set corresponding to any phoneme in the sample audio are all pitch frames, determining that the post-sound exists in any phoneme.
Optionally, the continuous sound indicator includes a front continuous sound indicator and a rear continuous sound indicator, the front continuous sound indicator is used for indicating whether continuous sound exists between a target phoneme in the pronunciation information and a previous phoneme adjacent to the target phoneme, and the rear continuous sound indicator is used for indicating whether continuous sound exists between the target phoneme in the pronunciation information and a next phoneme adjacent to the target phoneme;
or the continuous tone indicator comprises an indicator for indicating whether continuous tones exist between the target phoneme and the adjacent previous phoneme in the pronunciation information and whether continuous tones exist between the target phoneme and the adjacent next phoneme in the pronunciation information.
According to a second aspect of an embodiment of the present application, there is provided an audio generating apparatus including:
An acquisition module, configured to acquire a plurality of pronunciation information, where the plurality of pronunciation information includes at least one first pronunciation information, and each of the first pronunciation information includes: the method comprises the steps of enabling a pitch of a corresponding audio frame, content of a target phoneme corresponding to the corresponding audio frame, content of adjacent phonemes of the target phoneme and a continuous tone indicator, wherein the adjacent phonemes of any target phoneme comprise a previous phoneme and a next phoneme of any target phoneme, the continuous tone indicator is used for indicating whether continuous tones exist between the target phoneme in sound information and the adjacent phonemes of the target phoneme, and the audio frame corresponding to each sound information in a plurality of sound information is one audio frame in the target audio;
And the processing module is used for inputting the plurality of pronunciation information into the audio synthesis model to obtain target audio output by the audio synthesis model.
Optionally, the apparatus further comprises:
The analysis module is configured to analyze the sample audio before the plurality of pronunciation information is acquired to obtain a plurality of sample pronunciation information, where the plurality of sample pronunciation information includes at least one second pronunciation information, and each second pronunciation information includes: the method comprises the steps of enabling a pitch of a corresponding audio frame, content of a target phoneme corresponding to the corresponding audio frame, content of adjacent phonemes of the target phoneme and a continuous tone indicator, wherein the audio frame corresponding to each sample pronunciation information in the plurality of sample pronunciation information is one audio frame in the sample audio;
And the training module is used for carrying out model training based on the plurality of sample pronunciation information to obtain the audio synthesis model.
Optionally, the analysis module includes:
an acquisition sub-module for acquiring a pitch of each audio frame in the sample audio;
The detection sub-module is used for detecting whether continuous tones exist between each phoneme and adjacent phonemes in the sample audio to obtain continuous tone detection results;
and the generation sub-module is used for generating the plurality of sample pronunciation information based on the pitch of each audio frame and the continuous sound detection result.
Optionally, the detection submodule is configured to:
when M adjacent audio frames before and N adjacent audio frames after the starting point of a sample audio frame set corresponding to any phoneme in the sample audio are all pitch frames, determining that the front continuous sound exists in any phoneme, wherein the pitch frames are audio frames with the pitch being more than 0, N and M are positive integers, and the sample audio frame set corresponding to any phoneme is a set of audio frames formed by any phoneme in the sample audio in the pronunciation process;
and when M adjacent audio frames before and N adjacent audio frames after the end point of the sample audio frame set corresponding to any phoneme in the sample audio are all pitch frames, determining that the post-sound exists in any phoneme.
Optionally, the continuous sound indicator includes a front continuous sound indicator and a rear continuous sound indicator, the front continuous sound indicator is used for indicating whether continuous sound exists between a target phoneme in the pronunciation information and a previous phoneme adjacent to the target phoneme, and the rear continuous sound indicator is used for indicating whether continuous sound exists between the target phoneme in the pronunciation information and a next phoneme adjacent to the target phoneme;
or the continuous tone indicator comprises an indicator for indicating whether continuous tones exist between the target phoneme and the adjacent previous phoneme in the pronunciation information and whether continuous tones exist between the target phoneme and the adjacent next phoneme in the pronunciation information.
According to a third aspect of embodiments of the present application, there is provided a computer-readable storage medium having stored therein a computer program which, when executed by a processor, causes the processor to implement the audio generation method according to any one of the preceding first aspects.
According to a fourth aspect of embodiments of the present application, there is provided a computing device comprising a processor and a memory;
The memory stores computer instructions; the processor executes the computer instructions stored in the memory to cause the computing device to perform the audio generation method of any of the first aspects.
The technical scheme provided by the embodiment of the application can comprise the following beneficial effects:
According to the audio generation method and device, the pronunciation information in the input audio synthesis model comprises the continuous tone indicator, the continuous tone indicator is used for indicating whether continuous tones exist between the target phonemes and the adjacent phonemes in the pronunciation information, and the continuous tone condition of each phoneme is involved in the audio generation process, so that the audio synthesized by the audio synthesis model can effectively reflect the occurring continuous tone condition, and the sound smoothness at the continuous tone position is improved. Therefore, the change process of the human acoustic cavity can be effectively reflected, and the quality of output audio is improved.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application as claimed.
Drawings
In order to more clearly illustrate the embodiments of the present application, the drawings that are required for the description of the embodiments will be briefly described below, it being apparent that the drawings in the following description are only some embodiments of the present application and that other drawings may be obtained from these drawings without inventive effort for a person of ordinary skill in the art.
Fig. 1 is a flowchart illustrating an audio generation method according to an exemplary embodiment.
Fig. 2 is a flowchart illustrating another audio generation method according to an exemplary embodiment.
Fig. 3 is a block diagram of an audio generating apparatus according to an exemplary embodiment.
Fig. 4 is a block diagram of another audio generating apparatus according to an exemplary embodiment.
FIG. 5 is a block diagram illustrating an analysis module according to an exemplary embodiment.
Fig. 6 is a schematic diagram illustrating a structure of a terminal according to an exemplary embodiment.
Fig. 7 is a schematic diagram illustrating a structure of a server according to an exemplary embodiment.
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail with reference to the accompanying drawings.
A phoneme (phone) is a minimum unit of speech divided according to the natural attribute of speech, and is analyzed according to the pronunciation actions in syllables, one action constituting one phoneme. In different pronunciation rules, the type of phonemes is different. For example, for English pronunciation rules, the phonemes include vowel phonemes and consonant phonemes, each of which is subdivided into a plurality of specific phonemes, and phonetic symbols of an international phonetic symbol (formulated by the international phonetic society to uniformly mark letters of voices of each country; also referred to as "international phonetic letters" or "ten thousand phonetic letters") are in one-to-one correspondence with the phonemes; for the Chinese pronunciation rules, the pronunciation of each Chinese character can be decomposed into initials and finals, each phoneme comprises two types of initials and finals, each type is subdivided into a plurality of specific phonemes, and symbols in the initial and finals table of Chinese are in one-to-one correspondence with the phonemes.
The pronunciation of different phonemes requires changing the acoustic cavity to a different shape, whereas the change of the acoustic cavity requires a process, which can be divided simply into three phases, for example, open, stationary, closed. The opening and the closing are both processes of opening and closing changes of the acoustic cavity, if the sounds of the first phoneme and the second phoneme are similar in two adjacent phonemes, the changes of the acoustic cavity are not obvious when the two phonemes are continuously pronounced, and the situation can be called continuous sound when the stable phase of the first phoneme is directly converted into the stable phase of the second phoneme. For example, in the occurrence of the continuous sound, the closed state of the first phoneme and the open state of the second phoneme disappear in the two consecutive phonemes.
Taking Chinese pronunciation rules as examples, continuous tones can be generated by continuous reading of "one" and "sample" in "same" mode. However, when the song is singed, a little pause exists between the 'one' and the 'sample', and the situation of non-continuous sound occurs. Thus, in actual pronunciation, the same adjacent phonemes may have different pronunciation effects under different conditions.
In the conventional audio synthesis model, when generating audio, each of a plurality of pronunciation information includes: the pitch of the corresponding audio frame, the content of the target phoneme corresponding to the corresponding audio frame, the content of the previous phoneme of the target phoneme and the content of the next phoneme. The audio synthesized by the audio synthesis model cannot reflect the continuous sound condition which should appear, so that the sound smoothness at the continuous sound position is poor. Therefore, the human vocal cavity change process cannot be effectively reflected, resulting in poor quality of output audio.
The embodiment of the application provides an audio generation method which can be applied to generation of various types of audio, such as Chinese songs, english songs or other audio including human voice, such as comment or folk art forms (including ballad singing, story telling, comic dialogues, clapper talks, cross talks, etc.) audio. By the audio generation method, the simulation of human voice can be realized, so that artificial intelligent singing functions such as virtual song and the like are provided for users.
As shown in fig. 1, fig. 1 is a flowchart of the audio generation method, including:
step 101, acquiring a plurality of pieces of pronunciation information, wherein the plurality of pieces of pronunciation information comprise at least one piece of first pronunciation information, and each piece of first pronunciation information comprises: the pitch of the corresponding audio frame, the content of the target phoneme corresponding to the corresponding audio frame, the content of the adjacent phonemes of the target phoneme, and the continuous tone indicator.
The adjacent phonemes of any target phoneme comprise a previous phoneme and a next phoneme of any target phoneme, the continuous sound indicator is used for indicating whether continuous sound exists between the target phoneme and the adjacent phoneme in the pronunciation information, and the audio frame corresponding to each pronunciation information in the plurality of pronunciation information is one audio frame in the target audio.
Step 102, inputting a plurality of pronunciation information into the audio synthesis model to obtain target audio output by the audio synthesis model.
In summary, according to the audio generation method provided by the embodiment of the present application, since the pronunciation information in the input audio synthesis model includes the continuous tone indicator, the continuous tone indicator is used to indicate whether the continuous tone exists between the target phoneme and the adjacent phoneme in the pronunciation information, and since the continuous tone condition of each phoneme is involved in the audio generation process, the audio synthesized by the audio synthesis model can effectively reflect the occurring continuous tone condition, and the sound smoothness at the continuous tone is improved. Therefore, the change process of the human acoustic cavity can be effectively reflected, and the quality of output audio is improved.
The embodiment of the application provides another audio generation method, which can be executed by an audio generation device, wherein the audio generation device can be a terminal or a server, and the terminal can be a display, a computer, a smart phone, a tablet computer, a laptop portable computer and the like. The server may be one server or a server cluster composed of several servers. The method involves a model training process and a model use process, as shown in fig. 2, fig. 2 is a flow chart of the audio generation method, the method comprising:
Step 201, analyzing the sample audio to obtain a plurality of sample pronunciation information.
The sample audio may be pre-recorded one or more specified audio, which may be song audio or other audio including human voice, such as comment or folk art forms (including ballad singing, story telling, comic dialogues, clapper talks, cross talks, etc.) audio.
The sample audio may include a plurality of audio frames, where the plurality of audio frames respectively correspond to a plurality of sample pronunciation information, typically one-to-one, and each of the sample pronunciation information is used to represent an audio feature of the corresponding audio frame. The plurality of sample pronunciation information includes at least one second pronunciation information, each second pronunciation information including: the pitch of the corresponding audio frame, the content of the target phoneme corresponding to the corresponding audio frame, the content of the adjacent phonemes of the target phoneme, and the continuous tone indicator. The adjacent phonemes of any one target phoneme include a preceding phoneme and a following phoneme of the any one target phoneme. The previous phoneme and the next phoneme are generally different from the either target phoneme, respectively. Taking the Chinese pronunciation rules as an example, phonemes included in "hello" are "n, i, h, ao" in turn. For phonemes: vowels "i", the former phoneme is an initial consonant "n", and the latter phoneme is an initial consonant "h". The audio frame corresponding to each sample pronunciation information in the plurality of sample pronunciation information is one audio frame in the sample audio. The voice content of the corresponding audio frame contains the content of the corresponding phonemes.
Optionally, the process of analyzing the sample audio to obtain the plurality of sample pronunciation information may include:
and A1, acquiring the pitch of each audio frame in the sample audio.
For example, designated software may be employed to identify the pitch of each audio frame in the sample audio. Wherein, in the silence section, unvoiced section and transient phoneme transition zone of non-continuous sound of the sample audio, etc., the audio has no periodicity because the vocal cords of the person do not vibrate, and the pitch cannot be extracted; while the vocal cords are continuously vibrating in the voiced sound segment and the phoneme transition region of the continuous sound (i.e. the region between one phoneme and the other phoneme in the two phonemes of the continuous sound), the audio has periodicity, and the pitch can be extracted. The pitch may be recorded in the form of a sequence of pitch values or in the form of a pitch chart.
And A2, detecting whether continuous tones exist between each phoneme and adjacent phonemes in the sample audio to obtain continuous tone detection results.
The method for detecting whether continuous sound exists between each phoneme and adjacent phonemes in the sample audio is various. The embodiments of the present application are described by taking the following two alternative modes as examples:
In a first alternative, it is determined whether each phoneme is polyphonic with an adjacent phoneme by detecting whether there is a pitch frame in each of the pitch frame-adjacent audio frames in the sample audio. Wherein, there are audio frames with a pitch frame that is greater than 0.
In the embodiment of the application, the set of audio frames formed by any phoneme in the pronunciation process is the set of audio frames corresponding to the any phoneme. For the convenience of readers to understand, in the subsequent embodiments, a set of audio frames formed by any phoneme in the sample audio in the pronunciation process is referred to as a sample audio frame set corresponding to the any phoneme; and in the target audio, the set of audio frames formed by any phoneme in the pronunciation process is called a target audio frame set corresponding to any phoneme.
For each phoneme in the sample audio, it is detected whether M audio frames adjacent before and N audio frames adjacent after (i.e., m+n consecutive audio frames) the start point of the sample audio frame set corresponding to the phoneme are each a pitch frame, and N and M are each positive integers. When M adjacent audio frames before and N adjacent audio frames after the starting point of the sample audio frame set corresponding to any phoneme in the sample audio are all pitch frames, determining that the front continuous sound exists in any phoneme; and when any phoneme exists in the sample audio, determining that any non-pitch frame exists in M adjacent audio frames before and N adjacent audio frames after the starting point of the sample audio frame set corresponding to any phoneme, and determining that the pre-connection sound does not exist in any phoneme, wherein the non-pitch frame is an audio frame with the pitch equal to 0. The sample audio frame set corresponding to any one phoneme is a set of audio frames formed by pronunciation of the any one phoneme. That is, the set of audio frames corresponding to any phoneme is a set of one or more consecutive audio frames of any phoneme in the pronunciation process of the any phoneme. For example, assume the phonemes: when the pronunciation of the initial consonant "n" is short, the duration of the sound is only 70ms (millisecond), and the duration of one audio frame is 10ms, the sample audio frame set corresponding to the initial consonant "n" is provided with 7 audio frame frames, and the voice content of each audio frame contains the phoneme "n"; for another example, assume the phonemes: the vowel "i" pronounces longer and lasts 300ms, and the audio frames corresponding to the vowel "i" are assembled with 30 audio frame frames, and the voice content of each audio frame contains the phoneme "i".
For each phoneme in the sample audio, it is detected whether M audio frames adjacent before and N audio frames adjacent after (i.e., consecutive m+n audio frames) the end point of the sample audio frame set corresponding to the phoneme are each a pitch frame. When M adjacent audio frames before and N adjacent audio frames after the end point of the sample audio frame set corresponding to any phoneme in the sample audio are all pitch frames, determining that the post-sound exists in any phoneme; and when any phoneme exists in the sample audio, determining that any non-pitch frame exists in M adjacent audio frames before and N adjacent audio frames after the end point of the sample audio frame set corresponding to any phoneme, and determining that the non-pitch frame is an audio frame with the pitch equal to 0. Wherein M and N may be the same or different, for example, M and N may have values ranging from 1 to 5.
In an alternative example, the start point and the end point of the sample audio frame set corresponding to each phoneme may be respectively represented by a start time and an end time of the sample audio frame set in the audio where the sample audio frame set is located, where the start time is 9:00, the end point is 9:02; in another alternative example, each audio frame in the sample audio is assigned a sequence number that identifies the position of the corresponding audio frame in the sample audio, and the start point and the end point of the sample audio frame set corresponding to each phoneme may also be represented by the sequence number of the first audio frame and the sequence number of the last audio frame of the sample audio frame set, respectively. The representation of the sample audio frame set according to the embodiments of the present application is not limited.
For each phoneme, the starting point of the first audio frame of the corresponding sample audio frame set is the front demarcation point of the phoneme, the ending point of the last audio frame of the corresponding sample audio frame set is the rear demarcation point of the phoneme, and the step A2 is essentially to query whether the M audio frames adjacent before and the N audio frames adjacent after the front demarcation point of each phoneme are all pitch frames and whether the M audio frames adjacent before and the N audio frames adjacent after the rear demarcation point of each phoneme are all pitch frames, so as to determine whether continuous tones exist between each phoneme and the adjacent phoneme. That is, for the demarcation point of each phoneme, it is queried whether the M adjacent audio frames before and the N adjacent audio frames after the front demarcation point are each a pitch frame, so as to determine whether a continuous sound exists between each phoneme and the adjacent phoneme. By adopting the continuous tone detection mode, the detection mode of the boundary point of each phoneme is consistent. And the influence of errors in the determined sample audio frame set on the continuous tone detection result is effectively avoided, so that the detected continuous tone state can be ensured to be more accurate.
It should be noted that, when performing continuous tone detection on each phoneme, the foregoing continuous tone detection process may be performed by sequentially traversing all sample audio frame sets in the sample audio, skipping other irrelevant audio frames, or directly traversing all audio frames in the sample audio, where the foregoing continuous tone detection process is performed at the sample audio frame set corresponding to each phoneme.
For example, assuming that the division of phonemes is performed according to a chinese pronunciation rule, m=n=3, the text content of the sample audio is "all the same", the phonemes included are "w, o, M, en, d, ou, y, i, y, ang" in turn, for each of the phonemes in the sample audio, it is detected whether each of 3 adjacent audio frames before the start point and 3 adjacent audio frames after the start point (i.e., 6 adjacent audio frames) of the sample audio frame set corresponding to the phoneme is a pitch frame, and whether each of 3 adjacent audio frames before the end point and 3 adjacent audio frames after the end point of the sample audio frame set is a pitch frame, then for the phonemes: and if the first 3 audio frames and the last 3 audio frames adjacent to the starting point of the sample audio frame set are detected to be provided with pitch frames, and the first 3 audio frames and the last 3 audio frames adjacent to the ending point of the sample audio frame set are detected to be provided with pitch frames, the phoneme 'i' is provided with a front continuous sound and a rear continuous sound.
It should be noted that, in the sample audio, the sample audio frame set corresponding to each phoneme is known, and in an alternative manner, the sample audio frame set corresponding to each phoneme may be manually calibrated in advance; in another alternative, the sample audio frame set corresponding to each phoneme may be identified by the audio recognition software; in yet another alternative, the sample audio is pre-generated audio with known content for each phoneme, such as a song with lyrics downloaded in a network, and the set of sample audio frames for each phoneme is calibrated at the time of the sample audio acquisition. The embodiment of the application does not limit the acquisition mode of the sample audio frame set corresponding to each phoneme.
In a second alternative, it is determined by means of manual calibration whether there is a continuous sound between each phoneme and the adjacent phonemes.
As in step A1, the pitch of the audio frame may be recorded in the form of a sequence of pitch values or in the form of a pitch chart. The audio generating means may present the pitch of the sample audio and the corresponding sequence number (or icon) of the respective audio frame in the manner of the aforementioned recording. The staff can mark the audio frame where the phonemes of the front continuous sound and/or the phonemes of the rear continuous sound exist in a manual marking mode. Accordingly, the audio generating apparatus receives the labeling instruction, and determines whether or not a continuous sound exists between a phoneme in each audio frame and an adjacent phoneme based on the labeling instruction.
It should be noted that, the continuous tone indicator is used for indicating whether continuous tones exist between the target phoneme and the adjacent phonemes in the pronunciation information. The even indicator may be implemented in a variety of ways. The embodiments of the present application will be described by taking the following several implementation manners as examples.
In a first alternative implementation, the even-sound indicator includes a front even-sound indicator and a rear even-sound indicator. The pre-ligature indicator is used for indicating whether ligature exists between a target phoneme and a previous phoneme adjacent to the target phoneme in the pronunciation information. The post-ligature indicator is used for indicating whether ligature exists between a target phoneme and a next phoneme adjacent to the target phoneme in the pronunciation information. Wherein, the front sound indicator and the rear sound indicator can be composed of one or more characters. The character may be a binary character, such as 0 or 1. For example, a 0 may be used to indicate the presence of a continuous tone and a 1 may be used to indicate the absence of a continuous tone. The character may be other types of characters, such as letters, which are not limited in the embodiments of the present application. The front sound connection indicator and the rear sound connection indicator can occupy one field in the sound information respectively, and the front sound connection indicator and the rear sound connection indicator occupy two fields in the sound information.
In a second alternative implementation, the continuous tone indicator includes an indicator for indicating whether a continuous tone exists between a target phoneme in the present pronunciation information and a preceding phoneme adjacent thereto, and whether a continuous tone exists between a target phoneme in the present pronunciation information and a following phoneme adjacent thereto. Wherein the even indicator may be composed of one or more characters. The character may be a binary character, for example, the hyphenated indicator may include: 00. 01, 10 and 11. For example, 00 may be used to indicate that there is no continuous sound for both the preceding and following phones adjacent to the target phone, 01 may be used to indicate that there is continuous sound for both the preceding and following phones adjacent to the target phone, 10 may be used to indicate that there is no continuous sound for both the preceding and following phones adjacent to the target phone, and 11 may be used to indicate that there is continuous sound for both the preceding and following phones adjacent to the target phone. The character may be other types of characters, such as letters, which are not limited in the embodiments of the present application. The one indicator may occupy a field in the pronunciation information.
In the second alternative implementation manner, one indicator can simultaneously indicate the conditions indicated by the front continuous sound indicator and the rear continuous sound indicator, so that occupation of fields is reduced, and the operation efficiency of the subsequent model is improved.
When the continuous sound indicator is set by adopting the indication modes provided by the first and second alternative implementation modes, each piece of sound information in sound information corresponding to all audio frames in the sample audio is the second sound information, that is, each piece of sound information comprises the continuous sound indicator, so that the continuous sound condition can be indicated effectively.
In practical implementation, when the target phonemes in the pronunciation information and all phonemes adjacent to the target phonemes do not have continuous tones, continuous tone indicators can not be carried; when the target phonemes in the pronunciation information and all phonemes adjacent to the target phonemes have continuous tones, continuous tone indicators can be carried. That is, the plurality of sample pronunciation information corresponding to the plurality of audio frames of the sample audio includes two types of pronunciation information, which are respectively second pronunciation information and other pronunciation information, and the continuous tone indicator carried by the second pronunciation information may refer to the case in the first alternative implementation manner and the second alternative implementation manner. The content of the other pronunciation information may be referred to conventional pronunciation information or simply deformed on the basis of the content in the second pronunciation information. When the plurality of sample pronunciation information comprises two types of pronunciation information, compared with the case that all sample pronunciation information is second pronunciation information, the number of pronunciation information carrying continuous sound indicators in all sample pronunciation information can be reduced, occupation of fields is reduced, and operation efficiency of a subsequent model is improved.
And A3, generating a plurality of sample pronunciation information based on the pitch of each audio frame and the continuous sound detection result.
The audio generating means may generate a corresponding plurality of sample pronunciation information for all the audio frames based on the pitch of each audio frame and the continuous tone detection result.
It should be noted that, according to the actual situation, other information describing the corresponding audio frame may be added to the sample pronunciation information. Exemplary, the sample pronunciation information further includes: and the position information of the corresponding audio frame is used for describing the position of the corresponding audio frame in a sample audio frame set, wherein the sample audio frame set is a set of audio frames corresponding to a target phoneme corresponding to the corresponding audio frame. The explanation thereof may be referred to the explanation in the aforementioned step A2.
For example, the location information of a corresponding audio frame may be represented using the segment locations of the audio frame in a sample set of audio frames. Optionally, the sample audio frame set may be divided into w segments according to a preset segmentation rule (for example, the segmentation rule is an average segmentation rule), w is a positive integer, and the segmentation position is one of the w segments. Optionally, w is a fixed value, and w >1. For example, w=3, i.e., the sample audio frame set is divided into 3 segments, which are divided into open, stationary and closed segments of equal (or similar) duration according to an average segmentation rule. Then it is assumed that the audio frame corresponding to the sample voicing information is in an open segment, and the position information of the corresponding audio frame is used to indicate the open segment.
For example, the aforementioned location information may identify the segment location using one or more characters. The character may be a binary character, for example, the location information includes: 00. 01 and 10. By way of example, an open segment may be denoted by 00, a plateau by 01, and a closed segment by 10. The character may be other types of characters, such as letters, which are not limited in the embodiments of the present application. The aforementioned location information may occupy a field in the pronunciation information.
And 202, performing model training based on the pronunciation information of a plurality of samples to obtain an audio synthesis model.
Because the sample audio is known, the sample audio can be used as a label, a plurality of sample pronunciation information are used as input information, and model training is carried out until a loss value corresponding to a preset loss function is converged to a target range, so that an audio synthesis model is obtained.
By adopting the plurality of sample pronunciation information to carry out model training, the model learning device can effectively help the audio synthesis model learn different pronunciation states formed by phonemes in continuous sound and non-continuous sound states, and effectively improve the pronunciation smoothness of the audio generated by the audio synthesis model obtained by training at the continuous sound.
Step 203, obtaining a plurality of pronunciation information, wherein the plurality of pronunciation information includes at least one first pronunciation information, and each first pronunciation information includes: the pitch of the corresponding audio frame, the content of the target phoneme corresponding to the corresponding audio frame, the content of the adjacent phonemes of the target phoneme, and the continuous tone indicator.
Wherein the interpretation of adjacent phones and even indicators may refer to the interpretation in step 201 described previously. The target audio to be synthesized subsequently may include a plurality of audio frames, where the plurality of audio frames respectively correspond to the plurality of pronunciation information, typically one-to-one, and each pronunciation information is used to represent an audio feature of the corresponding audio frame. An audio frame can be correspondingly generated based on the pronunciation information. The audio frame corresponding to each pronunciation information is one frame of the audio frames formed in the pronunciation process of the corresponding phonemes, and the voice content of the corresponding audio frame contains the content of the corresponding phonemes.
In the embodiment of the present application, the process of obtaining multiple pronunciation information may have multiple implementation manners:
in a first implementation, the audio generating means may receive pronunciation information of a plurality of phonemes. Alternatively, the initial audio may be audio recorded by the user himself or may be audio acquired through other means, such as audio downloaded from a network. The user can acquire different types of initial audio based on own requirements, so that the generated subsequent target audio can effectively meet the user requirements, the customization and individuation of audio synthesis are realized, and the user experience is improved.
For example, the audio generating device is a mobile phone, a notebook computer, a desktop computer, or the like, and the user (or programmer) can Input the pronunciation information of the plurality of phonemes through an I/O (Input/Output), such as a keyboard or a touch screen, and correspondingly, the audio generating device receives the pronunciation information of the plurality of phonemes. Alternatively, the process of receiving pronunciation information of a plurality of phonemes by the audio generating apparatus may have the following two alternative examples: in a first alternative example, the audio generating apparatus receives first information to be edited, the first information to be edited including: the pitch of each target audio frame to be generated, the content of a target phoneme corresponding to each target audio frame, the content of adjacent phonemes of each target phoneme and a continuous tone indicator corresponding to each target phoneme; the audio generating device encodes the received first information to be edited to obtain pronunciation information of a plurality of phonemes. For example, the audio generating apparatus may perform the foregoing encoding of the first information to be edited by using onehot encoding method or emmbebing encoding method. In the second alternative example, the audio generating apparatus may directly receive the pronunciation information of a plurality of phonemes, each of which is information encoded by onehot encoding means, emmbebing encoding means, or the like.
In a second implementation, the audio generating device may receive at least one initial audio and analyze the at least one initial audio to obtain pronunciation information for a plurality of phonemes. The analysis process for each initial audio may refer to the process of analyzing the sample audio in step 201 described above. Alternatively, the process of obtaining pronunciation information of the plurality of phonemes may include: analyzing at least one initial audio frequency to obtain second information to be edited, wherein the second information to be edited comprises: the pitch of each target audio frame to be generated, the content of a target phoneme corresponding to each target audio frame, the content of adjacent phonemes of each target phoneme and a continuous tone indicator corresponding to each target phoneme; the audio generating device encodes the received second information to be edited to obtain pronunciation information of a plurality of phonemes. For example, the audio generating apparatus may perform encoding of the aforementioned second information to be edited by using onehot encoding method or emmbebing encoding method.
In practical implementation, the audio generating device may receive a plurality of initial audio frequencies, analyze the plurality of initial audio frequencies, and obtain pronunciation information of a plurality of phonemes, where in a subsequent process, the synthesized target audio frequency is equivalent to an audio frequency obtained by combining the plurality of initial audio frequencies.
Referring to step 201, the sample pronunciation information may further include other information describing the corresponding audio frame according to the actual situation. Accordingly, the pronunciation information obtained in step 203 is consistent with the information content in the sample pronunciation information, and other information describing the corresponding audio frame may be added. Illustratively, the pronunciation information further includes: position information of the corresponding audio frame describing the position of the corresponding audio frame (i.e. the audio frame to be generated) in the set of audio frames corresponding to the corresponding phoneme. The phoneme corresponding to the corresponding audio frame is assumed to be a first phoneme, and the audio frame set corresponding to the first phoneme is a target audio frame set, namely, a set of audio frames formed by the first phoneme in the pronunciation process in the target audio. The explanation of the location information may refer to the aforementioned step 201, which is not limited in the embodiment of the present application.
For the convenience of the reader, table 1 schematically shows the contents of a plurality of pronunciation information, which are pronunciation information of "same" as chinese text contents, and table 1 performs phoneme division in chinese pronunciation rules, as shown in table 1, assuming that the position information includes three types of 00, 01 and 10, 00 representing an open segment, 01 representing a stationary segment, and 10 representing a closed segment. The even indicators include a front even indicator and a rear even indicator, where 0 indicates the presence of even and 1 indicates the absence of even. "null" indicates no presence. Taking pronunciation information with the serial number of 4 corresponding to the audio frame as an example, the content of the pronunciation information is: the pitch is 150Hz, and the target phonemes are: vowel i (representing that phoneme i is contained in the voice content of the audio frame with the sequence number of 4), the former phoneme is the initial consonant y, the latter phoneme is the initial consonant y, the front ligature indicator is 0 (representing that the front ligature exists), the rear ligature indicator is 0 (representing that the rear ligature exists), and the position information is 00 (representing that the opening section exists). The explanation of other pronunciation information may refer to the explanation of the pronunciation information, and the embodiments of the present application will not be repeated.
TABLE 1
And 204, inputting the plurality of pronunciation information into the audio synthesis model to obtain target audio output by the audio synthesis model.
The audio generating device inputs the plurality of pronunciation information into an audio synthesis model, and the audio output by the audio synthesis model is the target audio. In the embodiment of the application, the audio synthesis model is a model for audio synthesis, and audio such as songs can be synthesized through the audio synthesis model. The audio synthesis model is typically a deep learning (DEEP LEARNING) model. The audio synthesis model may be wavenet model, or NPSS model, for example.
Steps 201 to 202 belong to a model training process, and steps 203 to 204 belong to a model using process. According to the audio generation method provided by the embodiment of the application, because the pronunciation information in the input audio synthesis model comprises the continuous tone indicator, the continuous tone indicator is used for indicating whether continuous tones exist between the target phonemes and the adjacent phonemes in the pronunciation information, and because the continuous tone condition of each phoneme is involved in the audio generation process, the audio synthesized by the audio synthesis model can effectively reflect the occurring continuous tone condition, and the sound smoothness at the continuous tone position is improved. Therefore, in the embodiment of the application, the pronunciation information is expanded, and whether the continuous sound information exists before and after the target phonemes in the pronunciation information is increased, so that the audio synthesis model is effectively helped to learn the composition of each pronunciation state under continuous sound and non-continuous sound, the pronunciation smoothness of continuous sound is effectively improved, the effective reflection of the human vocal cavity change process can be realized, and the quality of output audio is improved.
It should be noted that, the foregoing audio synthesis method may be executed by a terminal, may be executed by a server, or may be executed by a combination of the terminal and the server. In the first case, when the aforementioned audio synthesis method is performed by a terminal, the aforementioned audio synthesis apparatus is the terminal, and steps 201 to 204 are performed by the terminal. In the second case, when the audio synthesis method is executed by the server, the audio synthesis device is the server, and steps 201 to 204 are executed by the server, where the sample audio in step 201 may be sent by the terminal to the server or may be acquired by the server by itself; in the first implementation manner in step 203, the plurality of pronunciation information may be sent by the terminal to the server, or may be acquired by the server by itself; in the second implementation manner in step 203, at least one initial audio may be sent by the terminal to the server, or may be acquired by the server. After step 204, the server may send the generated target audio to the terminal. In a third case, when the foregoing audio synthesis method is cooperatively performed by a terminal and a server, the foregoing audio synthesis apparatus is regarded as a system composed of the terminal and the server, steps 201 to 202 are performed by the server, steps 203 to 204 are performed by the terminal, and after step 202, the server transmits the trained audio synthesis model to the terminal.
The sequence of the steps of the audio generation method provided by the embodiment of the application can be properly adjusted, the steps can be correspondingly increased or decreased according to the situation, and any method which is easily conceivable to be changed by a person skilled in the art within the technical scope of the disclosure of the application is covered in the protection scope of the application, so that the description is omitted.
An embodiment of the present application provides an audio generating apparatus 30, as shown in fig. 3, including:
The obtaining module 301 is configured to obtain a plurality of pronunciation information, where the plurality of pronunciation information includes at least one first pronunciation information, and each first pronunciation information includes: the method comprises the steps of corresponding audio frame pitch, corresponding target phoneme content of the corresponding audio frame, adjacent phoneme content of the target phonemes and continuous tone indicators, wherein each adjacent phoneme of any target phoneme comprises a previous phoneme and a next phoneme of any target phoneme, and the continuous tone indicators are used for indicating whether continuous tones exist between the target phonemes and the adjacent phonemes in the pronunciation information. The audio frame corresponding to each of the plurality of pronunciation information is one audio frame in the target audio.
The processing module 302 is configured to input a plurality of pronunciation information into the audio synthesis model, and obtain a target audio output by the audio synthesis model.
According to the audio generation device provided by the embodiment of the application, because the pronunciation information in the input audio synthesis model comprises the continuous tone indicator, the continuous tone indicator is used for indicating whether continuous tones exist between the target phonemes and the adjacent phonemes in the pronunciation information, and because the continuous tone condition of each phoneme is involved in the audio generation process, the audio synthesized by the audio synthesis model can effectively reflect the occurring continuous tone condition, and the sound smoothness at the continuous tone position is improved. Therefore, the effective reflection of the human acoustic cavity change process can be realized, and the quality of the output audio is improved.
Optionally, as shown in fig. 4, the apparatus 30 further includes:
The analyzing module 303 is configured to analyze the sample audio before acquiring the plurality of pronunciation information, to obtain a plurality of sample pronunciation information, where the plurality of sample pronunciation information includes at least one second pronunciation information, and each second pronunciation information includes: the method comprises the steps of enabling a pitch of a corresponding audio frame, a content of a target phoneme corresponding to the corresponding audio frame, a content of an adjacent phoneme of the target phoneme and a continuous tone indicator, wherein the audio frame corresponding to each sample pronunciation information in the plurality of sample pronunciation information is one audio frame in the sample audio;
the training module 304 is configured to perform model training based on the plurality of sample pronunciation information, and obtain an audio synthesis model.
Optionally, as shown in fig. 5, the analysis module 303 includes:
an acquisition submodule 3031, configured to acquire a pitch of each audio frame in the sample audio;
a detection sub-module 3032, configured to detect whether a continuous sound exists between each phoneme and an adjacent phoneme in the sample audio, so as to obtain a continuous sound detection result;
A generating sub-module 3033 is used for generating a plurality of sample pronunciation information based on the pitch of each audio frame and the continuous sound detection result.
Optionally, a detection submodule 3032 is configured to:
When in the sample audio, M adjacent audio frames before and N adjacent audio frames after the starting point of the sample audio frame set corresponding to any phoneme are all pitch frames, the existence of a front continuous tone of any phoneme is determined, the pitch frames are audio frames with the pitch being more than 0, N and M are both positive integers, and the sample audio frame set corresponding to any phoneme is a set of audio frames formed in the pronunciation process of any phoneme;
and when M adjacent audio frames before and N adjacent audio frames after the end point of the sample audio frame set corresponding to any phoneme in the sample audio are all pitch frames, determining that the post-sound exists in any phoneme.
Optionally, the continuous sound indicator includes a front continuous sound indicator and a rear continuous sound indicator, the front continuous sound indicator is used for indicating whether continuous sound exists between a target phoneme in the pronunciation information and a previous phoneme adjacent to the target phoneme, and the rear continuous sound indicator is used for indicating whether continuous sound exists between the target phoneme in the pronunciation information and a next phoneme adjacent to the target phoneme;
Or the continuous sound indicator comprises an indicator which is used for indicating whether continuous sound exists between the target phoneme in the pronunciation information and the adjacent previous phoneme and whether continuous sound exists between the target phoneme in the pronunciation information and the adjacent next phoneme.
In an exemplary embodiment, a non-transitory computer readable storage medium is also provided, such as a memory, that includes instructions executable by a processor of a computing device to perform the audio generation method shown in various embodiments of the application. For example, the non-transitory computer readable storage medium may be ROM, random Access Memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.
An embodiment of the application provides a computing device, which comprises a processor and a memory;
the memory stores computer instructions; the processor executes the computer instructions stored in the memory, causing the computing device to perform any one of the audio generation methods provided by the embodiments of the present application.
In an embodiment of the present application, the foregoing computing device may be a terminal, and fig. 6 shows a block diagram of a structure of a terminal 600 according to an exemplary embodiment of the present application. The terminal 600 may be: a smart phone, a tablet computer, an MP3 player (Moving Picture Experts Group Audio Layer III, motion picture expert compression standard audio plane 3), an MP4 (Moving Picture Experts Group Audio Layer IV, motion picture expert compression standard audio plane 4) player, a notebook computer, or a desktop computer. Terminal 600 may also be referred to by other names of user devices, portable terminals, laptop terminals, desktop terminals, etc.
In general, the terminal 600 includes: a processor 601 and a memory 602.
Processor 601 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the like. The processor 601 may be implemented in at least one hardware form of DSP (DIGITAL SIGNAL Processing), FPGA (Field-Programmable gate array), PLA (Programmable Logic Array ). Processor 601 may also include a main processor, which is a processor for processing data in an awake state, also referred to as a CPU (Central Processing Unit ), and a coprocessor; a coprocessor is a low-power processor for processing data in a standby state. In some embodiments, the processor 601 may integrate a GPU (Graphics Processing Unit, image processor) for rendering and drawing of content required to be displayed by the display screen. In some embodiments, the processor 601 may also include an AI (ARTIFICIAL INTELLIGENCE ) processor for processing computing operations related to machine learning.
The memory 602 may include one or more computer-readable storage media, which may be non-transitory. The memory 602 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 602 is used to store at least one instruction for execution by processor 601 to implement the audio generation method provided by the method embodiments of the present application.
In some embodiments, the terminal 600 may further optionally include: a peripheral interface 603, and at least one peripheral. The processor 601, memory 602, and peripheral interface 603 may be connected by a bus or signal line. The individual peripheral devices may be connected to the peripheral device interface 603 via buses, signal lines or a circuit board. Specifically, the peripheral device includes: at least one of radio frequency circuitry 604, a touch display 605, a camera 606, audio circuitry 607, a positioning component 608, and a power supply 609.
Peripheral interface 603 may be used to connect at least one peripheral device associated with an I/O to processor 601 and memory 602. In some embodiments, the processor 601, memory 602, and peripheral interface 603 are integrated on the same chip or circuit board; in some other embodiments, either or both of the processor 601, memory 602, and peripheral interface 603 may be implemented on separate chips or circuit boards, which is not limited in this embodiment.
The Radio Frequency circuit 604 is configured to receive and transmit RF (Radio Frequency) signals, also known as electromagnetic signals. The radio frequency circuit 604 communicates with a communication network and other communication devices via electromagnetic signals. The radio frequency circuit 604 converts an electrical signal into an electromagnetic signal for transmission, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 604 includes: antenna systems, RF transceivers, one or more amplifiers, tuners, oscillators, digital signal processors, codec chipsets, subscriber identity module cards, and so forth. The radio frequency circuit 604 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocol includes, but is not limited to: metropolitan area networks, various generations of mobile communication networks (2G, 3G, 4G, and 5G), wireless local area networks, and/or WiFi (WIRELESS FIDELITY ) networks. In some embodiments, the radio frequency circuit 604 may further include NFC (NEAR FIELD Communication) related circuits, which is not limited by the present application.
The display screen 605 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display 605 is a touch display, the display 605 also has the ability to collect touch signals at or above the surface of the display 605. The touch signal may be input as a control signal to the processor 601 for processing. At this point, the display 605 may also be used to provide virtual buttons and/or virtual keyboards, also referred to as soft buttons and/or soft keyboards. In some embodiments, the display 605 may be one, providing a front panel of the terminal 600; in other embodiments, the display 605 may be at least two, respectively disposed on different surfaces of the terminal 600 or in a folded design; in still other embodiments, the display 605 may be a flexible display, disposed on a curved surface or a folded surface of the terminal 600. Even more, the display 605 may be arranged in a non-rectangular irregular pattern, i.e., a shaped screen. The display 605 may be made of LCD (Liquid CRYSTAL DISPLAY), OLED (Organic Light-Emitting Diode), or other materials.
The camera assembly 606 is used to capture images or video. Optionally, the camera assembly 606 includes a front camera and a rear camera. Typically, the front camera is disposed on the front panel of the terminal and the rear camera is disposed on the rear surface of the terminal. In some embodiments, the at least two rear cameras are any one of a main camera, a depth camera, a wide-angle camera and a tele camera, so as to realize that the main camera and the depth camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize a panoramic shooting and Virtual Reality (VR) shooting function or other fusion shooting functions. In some embodiments, camera assembly 606 may also include a flash. The flash lamp can be a single-color temperature flash lamp or a double-color temperature flash lamp. The dual-color temperature flash lamp refers to a combination of a warm light flash lamp and a cold light flash lamp, and can be used for light compensation under different color temperatures.
The audio circuit 607 may include a microphone and a speaker. The microphone is used for collecting sound waves of users and environments, converting the sound waves into electric signals, and inputting the electric signals to the processor 601 for processing, or inputting the electric signals to the radio frequency circuit 604 for voice communication. For the purpose of stereo acquisition or noise reduction, a plurality of microphones may be respectively disposed at different portions of the terminal 600. The microphone may also be an array microphone or an omni-directional pickup microphone. The speaker is used to convert electrical signals from the processor 601 or the radio frequency circuit 604 into sound waves. The speaker may be a conventional thin film speaker or a piezoelectric ceramic speaker. When the speaker is a piezoelectric ceramic speaker, not only the electric signal can be converted into a sound wave audible to humans, but also the electric signal can be converted into a sound wave inaudible to humans for ranging and other purposes. In some embodiments, the audio circuit 607 may also include a headphone jack.
The location component 608 is utilized to locate the current geographic location of the terminal 600 to enable navigation or LBS (Location Based Service, location-based services). The positioning component 608 may be a positioning component based on the United states GPS (Global Positioning System ), the Beidou system of China, the Granati system of Russia, or the Galileo system of the European Union.
A power supply 609 is used to power the various components in the terminal 600. The power source 609 may be alternating current, direct current, disposable battery or rechargeable battery. When the power source 609 includes a rechargeable battery, the rechargeable battery may support wired or wireless charging. The rechargeable battery may also be used to support fast charge technology.
In some embodiments, the terminal 600 further includes one or more sensors 610. The one or more sensors 610 include, but are not limited to: acceleration sensor 611, gyroscope sensor 612, pressure sensor 613, fingerprint sensor 614, optical sensor 615, and proximity sensor 616.
The acceleration sensor 611 can detect the magnitudes of accelerations on three coordinate axes of the coordinate system established with the terminal 600. For example, the acceleration sensor 611 may be used to detect components of gravitational acceleration in three coordinate axes. The processor 601 may control the touch display screen 605 to display a user interface in a landscape view or a portrait view according to the gravitational acceleration signal acquired by the acceleration sensor 611. The acceleration sensor 611 may also be used for the acquisition of motion data of a game or a user.
The gyro sensor 612 may detect a body direction and a rotation angle of the terminal 600, and the gyro sensor 612 may collect a 3D motion of the user on the terminal 600 in cooperation with the acceleration sensor 611. The processor 601 may implement the following functions based on the data collected by the gyro sensor 612: motion sensing (e.g., changing UI according to a tilting operation by a user), image stabilization at shooting, game control, and inertial navigation.
The pressure sensor 613 may be disposed at a side frame of the terminal 600 and/or at a lower layer of the touch screen 605. When the pressure sensor 613 is disposed at a side frame of the terminal 600, a grip signal of the terminal 600 by a user may be detected, and a left-right hand recognition or a shortcut operation may be performed by the processor 601 according to the grip signal collected by the pressure sensor 613. When the pressure sensor 613 is disposed at the lower layer of the touch display screen 605, the processor 601 controls the operability control on the UI interface according to the pressure operation of the user on the touch display screen 605. The operability controls include at least one of a button control, a scroll bar control, an icon control, and a menu control.
The fingerprint sensor 614 is used to collect a fingerprint of a user, and the processor 601 identifies the identity of the user based on the fingerprint collected by the fingerprint sensor 614, or the fingerprint sensor 614 identifies the identity of the user based on the collected fingerprint. Upon recognizing that the user's identity is a trusted identity, the processor 601 authorizes the user to perform relevant sensitive operations including unlocking the screen, viewing encrypted information, downloading software, paying for and changing settings, etc. The fingerprint sensor 614 may be provided on the front, back, or side of the terminal 600. When a physical key or vendor Logo is provided on the terminal 600, the fingerprint sensor 614 may be integrated with the physical key or vendor Logo.
The optical sensor 615 is used to collect ambient light intensity. In one embodiment, processor 601 may control the display brightness of touch display 605 based on the intensity of ambient light collected by optical sensor 615. Specifically, when the intensity of the ambient light is high, the display brightness of the touch display screen 605 is turned up; when the ambient light intensity is low, the display brightness of the touch display screen 605 is turned down. In another embodiment, the processor 601 may also dynamically adjust the shooting parameters of the camera assembly 606 based on the ambient light intensity collected by the optical sensor 615.
A proximity sensor 616, also referred to as a distance sensor, is typically provided on the front panel of the terminal 600. The proximity sensor 616 is used to collect the distance between the user and the front of the terminal 600. In one embodiment, when the proximity sensor 616 detects a gradual decrease in the distance between the user and the front face of the terminal 600, the processor 601 controls the touch display 605 to switch from the bright screen state to the off screen state; when the proximity sensor 616 detects that the distance between the user and the front surface of the terminal 600 gradually increases, the processor 601 controls the touch display screen 605 to switch from the off-screen state to the on-screen state.
Those skilled in the art will appreciate that the structure shown in fig. 6 is not limiting of the terminal 600 and may include more or fewer components than shown, or may combine certain components, or may employ a different arrangement of components.
In an embodiment of the present application, the foregoing computing device may be a server, and fig. 7 is a schematic structural diagram of a server according to an exemplary embodiment. The server 700 includes a Central Processing Unit (CPU) 701, a system memory 704 including a Random Access Memory (RAM) 702 and a Read Only Memory (ROM) 703, and a system bus 705 connecting the system memory 704 and the central processing unit 701. The server 700 also includes a basic input/output system (I/O system) 706, for aiding in the transfer of information between the various devices within the computer, and a mass storage device 707 for storing an operating system 713, application programs 714, and other program modules 715.
The basic input/output system 706 includes a display 708 for displaying information and an input device 709, such as a mouse, keyboard, or the like, for a user to input information. Wherein the display 708 and the input device 709 are coupled to the central processing unit 701 through an input output controller 710 coupled to a system bus 705. The basic input/output system 706 may also include an input/output controller 710 for receiving and processing input from a number of other devices, such as a keyboard, mouse, or electronic stylus. Similarly, the input output controller 710 also provides output to a display screen, a printer, or other type of output device.
The mass storage device 707 is connected to the central processing unit 701 through a mass storage controller (not shown) connected to the system bus 705. The mass storage device 707 and its associated computer readable media provide non-volatile storage for the server 700. That is, the mass storage device 707 may include a computer readable medium (not shown) such as a hard disk or CD-ROM drive.
The computer readable medium may include computer storage media and communication media without loss of generality. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes RAM, ROM, EPROM, EEPROM, flash memory, or other solid state memory technology, CD-ROM, DVD, or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage, or other magnetic storage devices. Of course, those skilled in the art will recognize that the computer storage medium is not limited to the one described above. The system memory 704 and mass storage device 707 described above may be collectively referred to as memory.
The server 700 may also operate via a network, such as the internet, connected to a remote computer on the network, in accordance with various embodiments of the present application. I.e. the server 700 may be connected to the network 712 via a network interface unit 711 connected to the system bus 705, or alternatively, the network interface unit 711 may be used to connect to other types of networks or remote computer systems (not shown).
The memory further includes one or more programs stored in the memory, and the central processor 701 implements the audio generation method provided by the embodiment of the present application by executing the one or more programs.
It will be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process of the apparatus and modules described above may refer to the corresponding process in the foregoing method embodiment, which is not repeated herein.
Other embodiments of the application will be apparent to those skilled in the art from consideration of the specification and practice of the application disclosed herein. This application is intended to cover any variations, uses, or adaptations of the application following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the application pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.
In the present disclosure, the terms "first" and "second" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. The term "plurality" refers to two or more, unless explicitly defined otherwise. "A refers to B" means that A is the same as B, or that A is simply deformed based on B. The term "and/or" in the present application is merely an association relation describing the association object, and indicates that three kinds of relations may exist, for example, a and/or B may indicate: a exists alone, A and B exist together, and B exists alone. In addition, the character "/" herein generally indicates that the front and rear associated objects are an "or" relationship.
It is to be understood that the application is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims (10)

1. An audio generation method, comprising:
When in sample audio, M adjacent audio frames before and N adjacent audio frames after the starting point of a sample audio frame set corresponding to any phoneme are each a pitch frame, the existence of a front continuous sound of any phoneme is determined, the pitch frames are audio frames with the pitch being more than 0, N and M are both positive integers, the sample audio frame set corresponding to any phoneme is a set formed by continuous one or more audio frames of any phoneme in the pronunciation process of any phoneme, and the speech content comprises the speech frames of any phoneme; when M adjacent audio frames before and N adjacent audio frames after the end point of the sample audio frame set corresponding to any phoneme in the sample audio are all pitch frames, determining that the post-sound exists in any phoneme;
detecting whether continuous tones exist between each phoneme and adjacent phonemes in the sample audio to obtain continuous tone detection results; based on the continuous sound detection result, obtaining a plurality of sample pronunciation information, wherein the plurality of sample pronunciation information comprises at least one second pronunciation information; model training is carried out based on the plurality of sample pronunciation information to obtain an audio synthesis model;
When the target phonemes corresponding to the second pronunciation information and all phonemes adjacent to the target phonemes exist continuous sounds, the second pronunciation information carries continuous sound indicators; when the target phonemes corresponding to the second pronunciation information and all phonemes adjacent to the target phonemes do not have continuous tones, the second pronunciation information does not carry the continuous tone indicator, and the continuous tone indicator is used for indicating whether continuous tones exist between the target phonemes and adjacent phonemes in the pronunciation information, and the adjacent phonemes of any target phonemes comprise a previous phoneme and a next phoneme of any target phonemes;
Acquiring a plurality of pronunciation information corresponding to at least one initial audio; inputting the plurality of pronunciation information into the audio synthesis model to obtain target audio output by the audio synthesis model, wherein the target audio is obtained by combining the at least one initial audio;
Wherein the plurality of pronunciation information includes at least one first pronunciation information, each of the first pronunciation information includes: the method comprises the steps of enabling a pitch of a corresponding audio frame, content of a target phoneme corresponding to the corresponding audio frame, content of adjacent phonemes of the target phoneme and a continuous tone indicator, wherein the audio frame corresponding to each of a plurality of pronunciation information is one audio frame in the target audio.
2. The method of claim 1, wherein each of the second pronunciation information further comprises: the method comprises the steps of determining the pitch of a corresponding audio frame, the content of a target phoneme corresponding to the corresponding audio frame and the content of adjacent phonemes of the target phoneme, wherein the audio frame corresponding to each sample pronunciation information in the plurality of sample pronunciation information is one audio frame in the sample audio.
3. The method of claim 2, wherein obtaining a plurality of sample pronunciation information based on the continuous tone detection result comprises:
Acquiring the pitch of each audio frame in the sample audio;
And generating the plurality of sample pronunciation information based on the pitch of each audio frame and the continuous sound detection result.
4. A method according to any one of claims 1 to 3, wherein the concatenative indicator comprises a front concatenative indicator for indicating whether or not there is a concatenation of a target phoneme in the pronunciation information and a preceding phoneme adjacent thereto, and a rear concatenative indicator for indicating whether or not there is a concatenation of a target phoneme in the pronunciation information and a following phoneme adjacent thereto;
or the continuous tone indicator comprises an indicator for indicating whether continuous tones exist between the target phoneme and the adjacent previous phoneme in the pronunciation information and whether continuous tones exist between the target phoneme and the adjacent next phoneme in the pronunciation information.
5. An audio generating apparatus, comprising:
The analysis module comprises a detection sub-module and a generation sub-module;
The detection sub-module is used for determining that a pre-continuous sound exists in any phoneme when M adjacent audio frames before and N adjacent audio frames after the starting point of a sample audio frame set corresponding to any phoneme in sample audio are all pitch frames, the pitch frames are audio frames with the pitch being more than 0, N and M are positive integers, the sample audio frame set corresponding to any phoneme is a set formed by continuous one or more audio frames of any phoneme in the pronunciation process of the any phoneme, and the speech content comprises the voice frames of the any phoneme; when M adjacent audio frames before and N adjacent audio frames after the end point of the sample audio frame set corresponding to any phoneme in the sample audio are all pitch frames, determining that the post-sound exists in any phoneme; detecting whether continuous tones exist between each phoneme and adjacent phonemes in the sample audio to obtain continuous tone detection results;
The generating sub-module is used for obtaining a plurality of sample pronunciation information based on the continuous sound detection result, wherein the plurality of sample pronunciation information comprises at least one second pronunciation information;
the training module is used for carrying out model training based on the plurality of sample pronunciation information to obtain an audio synthesis model;
When the target phonemes corresponding to the second pronunciation information and all phonemes adjacent to the target phonemes exist continuous sounds, the second pronunciation information carries continuous sound indicators; when the target phonemes corresponding to the second pronunciation information and all phonemes adjacent to the target phonemes do not have continuous tones, the second pronunciation information does not carry the continuous tone indicator, and the continuous tone indicator is used for indicating whether continuous tones exist between the target phonemes and adjacent phonemes in the pronunciation information, and the adjacent phonemes of any target phonemes comprise a previous phoneme and a next phoneme of any target phonemes;
the acquisition module is used for acquiring a plurality of pronunciation information corresponding to at least one initial audio;
The processing module is used for inputting the plurality of pronunciation information into the audio synthesis model to obtain target audio output by the audio synthesis model, wherein the target audio is obtained by combining the at least one initial audio;
Wherein the plurality of pronunciation information includes at least one first pronunciation information, each of the first pronunciation information includes: the method comprises the steps of enabling a pitch of a corresponding audio frame, content of a target phoneme corresponding to the corresponding audio frame, content of adjacent phonemes of the target phoneme and a continuous tone indicator, wherein the audio frame corresponding to each of a plurality of pronunciation information is one audio frame in the target audio.
6. The apparatus of claim 5, wherein each of the second pronunciation information further comprises: the method comprises the steps of determining the pitch of a corresponding audio frame, the content of a target phoneme corresponding to the corresponding audio frame and the content of adjacent phonemes of the target phoneme, wherein the audio frame corresponding to each sample pronunciation information in the plurality of sample pronunciation information is one audio frame in the sample audio.
7. The apparatus of claim 6, wherein the analysis module further comprises:
an acquisition sub-module for acquiring a pitch of each audio frame in the sample audio;
The generating sub-module is further configured to generate the plurality of sample pronunciation information based on the pitch of each audio frame and the continuous tone detection result.
8. The apparatus according to any one of claims 5 to 7, wherein the concatenative indicator includes a front concatenative indicator for indicating whether or not a concatenation sound exists between a target phoneme in the present pronunciation information and a preceding phoneme adjacent thereto, and a rear concatenative indicator for indicating whether or not a concatenation sound exists between a target phoneme in the present pronunciation information and a following phoneme adjacent thereto;
or the continuous tone indicator comprises an indicator for indicating whether continuous tones exist between the target phoneme and the adjacent previous phoneme in the pronunciation information and whether continuous tones exist between the target phoneme and the adjacent next phoneme in the pronunciation information.
9. A computer readable storage medium, characterized in that the computer readable storage medium has stored therein a computer program which, when executed by a processor, causes the processor to implement the audio generation method according to any one of claims 1 to 4.
10. A computing device, the computing device comprising a processor and a memory;
The memory stores computer instructions; the processor executing computer instructions stored in the memory, causes the computing device to perform the audio generation method of any one of claims 1 to 4.
CN201911267158.4A 2019-12-11 2019-12-11 Audio generation method, device, computer readable storage medium and computing equipment Active CN111028823B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911267158.4A CN111028823B (en) 2019-12-11 2019-12-11 Audio generation method, device, computer readable storage medium and computing equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911267158.4A CN111028823B (en) 2019-12-11 2019-12-11 Audio generation method, device, computer readable storage medium and computing equipment

Publications (2)

Publication Number Publication Date
CN111028823A CN111028823A (en) 2020-04-17
CN111028823B true CN111028823B (en) 2024-06-07

Family

ID=70208741

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911267158.4A Active CN111028823B (en) 2019-12-11 2019-12-11 Audio generation method, device, computer readable storage medium and computing equipment

Country Status (1)

Country Link
CN (1) CN111028823B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113035228B (en) * 2021-03-23 2024-08-23 广州酷狗计算机科技有限公司 Acoustic feature extraction method, acoustic feature extraction device, acoustic feature extraction equipment and storage medium

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4692941A (en) * 1984-04-10 1987-09-08 First Byte Real-time text-to-speech conversion system
CN1257271A (en) * 1998-12-02 2000-06-21 松下电器产业株式会社 Continuous sound processor for Chinese phonetic systhesis
CN1267863A (en) * 1999-03-22 2000-09-27 Lg电子株式会社 Image apparatus with education function and its controlling method
JP2001343987A (en) * 2000-05-31 2001-12-14 Sanyo Electric Co Ltd Method and device for voice synthesis
JP2002333896A (en) * 2001-05-10 2002-11-22 Matsushita Electric Ind Co Ltd Device, system and method for synthesizing voice
CN1455386A (en) * 2002-11-01 2003-11-12 中国科学院声学研究所 Imbedded voice synthesis method and system
CN1938756A (en) * 2004-03-05 2007-03-28 莱塞克技术公司 Prosodic speech text codes and their use in computerized speech systems
CN104464751A (en) * 2014-11-21 2015-03-25 科大讯飞股份有限公司 Method and device for detecting pronunciation rhythm problem
CN104934028A (en) * 2015-06-17 2015-09-23 百度在线网络技术(北京)有限公司 Depth neural network model training method and device used for speech synthesis

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104021784B (en) * 2014-06-19 2017-06-06 百度在线网络技术(北京)有限公司 Phoneme synthesizing method and device based on Big-corpus

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4692941A (en) * 1984-04-10 1987-09-08 First Byte Real-time text-to-speech conversion system
CN1257271A (en) * 1998-12-02 2000-06-21 松下电器产业株式会社 Continuous sound processor for Chinese phonetic systhesis
CN1267863A (en) * 1999-03-22 2000-09-27 Lg电子株式会社 Image apparatus with education function and its controlling method
JP2001343987A (en) * 2000-05-31 2001-12-14 Sanyo Electric Co Ltd Method and device for voice synthesis
JP2002333896A (en) * 2001-05-10 2002-11-22 Matsushita Electric Ind Co Ltd Device, system and method for synthesizing voice
CN1455386A (en) * 2002-11-01 2003-11-12 中国科学院声学研究所 Imbedded voice synthesis method and system
CN1938756A (en) * 2004-03-05 2007-03-28 莱塞克技术公司 Prosodic speech text codes and their use in computerized speech systems
CN104464751A (en) * 2014-11-21 2015-03-25 科大讯飞股份有限公司 Method and device for detecting pronunciation rhythm problem
CN104934028A (en) * 2015-06-17 2015-09-23 百度在线网络技术(北京)有限公司 Depth neural network model training method and device used for speech synthesis

Also Published As

Publication number Publication date
CN111028823A (en) 2020-04-17

Similar Documents

Publication Publication Date Title
CN111899706B (en) Audio production method, device, equipment and storage medium
CN110992927B (en) Audio generation method, device, computer readable storage medium and computing equipment
CN110933330A (en) Video dubbing method and device, computer equipment and computer-readable storage medium
CN110556127B (en) Method, device, equipment and medium for detecting voice recognition result
CN110322760B (en) Voice data generation method, device, terminal and storage medium
CN111524501B (en) Voice playing method, device, computer equipment and computer readable storage medium
CN111564152B (en) Voice conversion method and device, electronic equipment and storage medium
CN108829751B (en) Method and device for generating lyrics and displaying lyrics, electronic equipment and storage medium
CN110931048B (en) Voice endpoint detection method, device, computer equipment and storage medium
CN112735429B (en) Method for determining lyric timestamp information and training method of acoustic model
CN112116904B (en) Voice conversion method, device, equipment and storage medium
CN111105788B (en) Sensitive word score detection method and device, electronic equipment and storage medium
CN111428079B (en) Text content processing method, device, computer equipment and storage medium
CN111081277B (en) Audio evaluation method, device, equipment and storage medium
CN111048109A (en) Acoustic feature determination method and apparatus, computer device, and storage medium
CN111223475B (en) Voice data generation method and device, electronic equipment and storage medium
CN110867194B (en) Audio scoring method, device, equipment and storage medium
CN113362836B (en) Vocoder training method, terminal and storage medium
CN111028823B (en) Audio generation method, device, computer readable storage medium and computing equipment
CN113920979B (en) Voice data acquisition method, device, equipment and computer readable storage medium
CN112786025B (en) Method for determining lyric timestamp information and training method of acoustic model
CN111091807B (en) Speech synthesis method, device, computer equipment and storage medium
CN115394285A (en) Voice cloning method, device, equipment and storage medium
CN111212323A (en) Audio and video synthesis method and device, electronic equipment and medium
CN114760493B (en) Method, device and storage medium for adding lyric progress image

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant