CN117672182A

CN117672182A - Sound cloning method and system based on artificial intelligence

Info

Publication number: CN117672182A
Application number: CN202410145879.2A
Authority: CN
Inventors: 刘仁勤; 万礼强
Original assignee: Tuoshe Technology Group Co ltd; Jiangxi Tuoshi Intelligent Technology Co ltd
Current assignee: Tuoshe Technology Group Co ltd; Jiangxi Tuoshi Intelligent Technology Co ltd
Priority date: 2024-02-02
Filing date: 2024-02-02
Publication date: 2024-03-08
Anticipated expiration: 2044-02-02

Abstract

The invention provides a sound cloning method and a sound cloning system based on artificial intelligence, which are characterized in that original text is regularized and sequentially converted into a plurality of sentences to be converted and a plurality of words to be converted, the pinyin of each word is obtained, the pinyin of each word is marked to obtain a first mark, the initials and the finals in the pinyin of the word are split, the first marks of the pinyin of the word are given to the finals, the initials in the pinyin of the word are marked, phoneme information is determined according to a preset rule, the phoneme information comprises a target mark in the first mark, then phrases are recombined, the pause time between the recombined phrases is determined according to the speaking speed of a user, finally, the sound is converted into acoustic characteristics according to the word and the corresponding phoneme information, the acoustic characteristics are converted into target waveforms, and sound cloning is completed according to the target waveforms.

Description

Sound cloning method and system based on artificial intelligence

Technical Field

The invention belongs to the technical field of sound cloning, and particularly relates to a sound cloning method and system based on artificial intelligence.

Background

The voice cloning can be understood as customized voice synthesis, which can be converted into voice conforming to the voice of a corresponding user according to an input text, and the current voice synthesis mainly comprises three modules, namely a text front end module, an acoustic model and a vocoder, wherein the text front end module mainly works by converting an original text into characters/phonemes, specifically, the characters are the minimum meaningful units of a language writing system, the phonemes are the minimum voice units for distinguishing words, and usually, text preprocessing, such as segmentation, word segmentation and the like, is performed in a manual mode in order to help the naturalness of the voice cloning, but different people are different in habit, possibly have different prosody, and are designed manually only by means of professional semantic knowledge and experience, so that the voice cloning is time-consuming and labor-consuming and has poor effect.

Disclosure of Invention

Based on the above, the embodiment of the invention provides a sound cloning method and system based on artificial intelligence, which are used for solving the problems that in the prior art, the artificial design is performed only by means of professional semantic knowledge and experience, and is time-consuming and labor-consuming, and the effect is poor.

A first aspect of an embodiment of the present invention provides an artificial intelligence-based sound cloning method, applied to a chinese scenario, the method including:

Acquiring an original text, and regularizing the original text to obtain a first text;

splitting the first text into a plurality of sentences to be converted according to a preset identifier, and respectively performing word segmentation on the sentences to be converted to obtain a plurality of words to be converted;

the pinyin of the word to be converted is obtained, the pinyin of each word in the word to be converted is marked according to four tones of the pinyin, and a first mark of the pinyin of each word is obtained, wherein at least one first mark exists in the pinyin of the word;

splitting initials and finals in the pinyin of the characters, assigning a first mark of the pinyin of the characters to the finals, and marking the initials in the pinyin of the characters;

determining phoneme information according to a preset rule, wherein the phoneme information comprises target marks in the first marks;

recombining the phrases, and determining the pause time between the recombined phrases according to the speaking speed of the user;

and converting the sound characteristics into target waveforms according to the words and the corresponding phoneme information, and completing sound cloning according to the target waveforms.

Further, the step of obtaining the pinyin of the word to be converted, and labeling the pinyin of each word in the word to be converted according to four tones of the pinyin to obtain a first label of the pinyin of each word, wherein the step of labeling at least one first label of the pinyin of the word includes:

Establishing a first mapping model of the tone symbol and each first annotation, wherein the first mapping model is used for inputting the tone symbol and outputting the corresponding first annotation;

and recognizing the tone symbols of the pinyin of each character in the word to be converted, inputting the tone symbols of the pinyin of each character in the word to be converted into the first mapping model, and outputting the corresponding first label.

Further, the step of determining phoneme information according to a preset rule, where the phoneme information includes a target label in the first label includes:

identifying all words and judging whether a target word exists, wherein the target word at least comprises 'one' and 'not';

if the target word is judged to exist, judging whether the target word is in the word tail or not when the target word is 'one';

if the target word 'one' is judged not to be in the word tail, judging whether the words representing the number exist before and/or after the target word 'one';

if the front and/or rear of the target word 'one' is judged to have no word representing the number, judging whether the word group of the target word 'one' and the rear word is a graduated word or not;

if the word combination of the target word 'one' and the next word is the graduated word, judging whether the tone of the next word is the fourth tone;

If the tone of the next word is judged to be the fourth tone, defining the target mark of the target word 'one' as a mark corresponding to the second sound;

if the tone of the next word is judged to be not the fourth tone, defining a target mark of the target word 'one' as a mark corresponding to the fourth tone;

if the target word is judged to exist, judging whether the tone of the next word after the target word is 'no' is fourth sound or not when the target word is 'no';

if the tone of the next word after the target word is judged to be the fourth tone, the target mark defining the target word of no is the mark corresponding to the second sound.

Further, if the target word "a" is not located at the end of the word, the step of determining whether the number of words is represented before and/or after the target word "a" further includes:

if the word representing the number exists before and/or after the target word 'one', judging whether the phrase comprising the target word 'one' is a preset phrase or not;

if the phrase comprising the target word 'one' is judged to be the preset phrase, the target mark defining the target word 'one' is the mark corresponding to the preset phrase.

Further, the step of determining phoneme information according to a preset rule, where the phoneme information includes a target label in the first label further includes:

Judging whether at least more than two continuous labels are third sounds in the words to be converted, wherein the labeling result of the initial consonants is not considered;

if the fact that more than two continuous labels are third sounds exists in the word to be converted is judged, when the continuous labels of the two third sounds exist, modifying the label of the first third sound into the label of the second sound;

if the fact that more than two continuous labels are third sounds exists in the words to be converted is judged, when the fact that three continuous labels are third sounds exists, whether the corresponding three words are numbers is judged;

if the corresponding three numbers are judged to be digital, modifying the labels of the first third sound and the second third sound into the labels of the second sound;

if the corresponding three words are not numbers, respectively combining the second word with the adjacent words, and judging whether the second word is combined with the first word or the second word is combined with the third word;

if the second word is judged to be combined with the first word, modifying the labels of the first third sound and the second third sound into the labels of the second sound;

and if the second word is judged to be combined with the third word, modifying the second and third sound labels into second sound labels.

Further, the step of recombining the phrases and determining the pause time between the recombined phrases according to the speaking speed of the user comprises the following steps:

acquiring total time for reading out a preset text by a user and time of each mark in the preset text, wherein the marks in the preset text are used for spacing adjacent phrases;

calculating the average pronunciation time of a single word according to the total time, the time of each mark and the word number of the preset text;

clustering the time of each mark in the preset text according to a clustering algorithm, and determining a preset number of target marks according to a clustering result, wherein different target marks correspond to different pause times.

acquiring a first number of the word groups with punctuation marks being recombined among commas, and judging whether the first number is more than or equal to a second number, wherein the second number is the number of target marks without the punctuation marks;

if yes, randomly and averagely inserting target marks representing different pause times among the phrases;

If not, sorting all the target marks from small to large according to the pause time, and sequentially selecting all the target marks with the same number as the first number;

randomly and averagely inserting each selected target mark among each phrase.

A second aspect of an embodiment of the present invention provides an artificial intelligence-based sound cloning system, applied to a chinese scene, the system comprising:

the regularization processing module is used for acquiring an original text, and regularizing the original text to obtain a first text;

the first splitting module is used for splitting the first text into a plurality of sentences to be converted according to a preset identifier, and respectively carrying out word segmentation on the sentences to be converted to obtain a plurality of words to be converted;

the labeling module is used for acquiring the pinyin of the word to be converted, labeling the pinyin of each word in the word to be converted according to four tones of the pinyin, and obtaining a first label of the pinyin of each word, wherein at least one first label exists in the pinyin of the word;

the second splitting module is used for splitting initials and finals in the pinyin of the characters, assigning a first mark of the pinyin of the characters to the finals, and marking the initials in the pinyin of the characters;

The first determining module is used for determining phoneme information according to a preset rule, wherein the phoneme information comprises target marks in the first marks;

the second determining module is used for recombining the phrases and determining the pause time among the recombined phrases according to the speaking speed of the user;

and the conversion module is used for converting the sound characteristics into target waveforms according to the words and the corresponding phoneme information, and completing sound cloning according to the target waveforms.

A third aspect of an embodiment of the present invention provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the artificial intelligence based sound cloning method provided in the first aspect.

A fourth aspect of an embodiment of the present invention provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the artificial intelligence based sound cloning method provided in the first aspect when executing the program.

According to the sound cloning method and system based on artificial intelligence, original text is regularized and sequentially converted into a plurality of sentences to be converted and a plurality of words to be converted, the pinyin of each word is obtained, the pinyin of each word is marked to obtain a first mark, the initials and the finals in the pinyin of the word are split, the first mark of the pinyin of the word is given to the finals, the initials in the pinyin of the word are marked, phoneme information is determined according to a preset rule, the phoneme information comprises target marks in the first mark, then word groups are recombined, the pause time between the recombined word groups is determined according to the speaking speed of a user, finally the acoustic features are converted into target waveforms according to the word and the corresponding phoneme information, and sound cloning is completed according to the target waveforms.

Drawings

FIG. 1 is a flowchart of an artificial intelligence-based sound cloning method according to an embodiment of the present invention;

FIG. 2 is a block diagram of an artificial intelligence based sound cloning system according to a second embodiment of the present invention;

fig. 3 is a block diagram of an electronic device according to a third embodiment of the present invention.

Detailed Description

In order that the invention may be readily understood, a more complete description of the invention will be rendered by reference to the appended drawings. Several embodiments of the invention are presented in the figures. This invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete.

It will be understood that when an element is referred to as being "mounted" on another element, it can be directly on the other element or intervening elements may also be present. When an element is referred to as being "connected" to another element, it can be directly connected to the other element or intervening elements may also be present. The terms "vertical," "horizontal," "left," "right," and the like are used herein for illustrative purposes only.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used herein in the description of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. The term "and/or" as used herein includes any and all combinations of one or more of the associated listed items.

Example 1

Referring to fig. 1, fig. 1 shows an artificial intelligence-based sound cloning method applied to a chinese scene, for example, a scene with less spoken language such as broadcasting, and the sound cloning method specifically includes steps S01 to S07.

Step S01, an original text is obtained, and regularization processing is carried out on the original text to obtain a first text.

Specifically, the original text may be one-end text, a sentence, etc., for facilitating subsequent processing, the original text is first regularized to be converted into standard text, for example, "1996" including numbers, after regularization, the original text is "nine six years", "0.1%" after regularization, and "one hundredth", etc., it should be noted that, according to language habit, there may be some words after regularization, and the obtained words are not intended by the user, for example, "211 colleges" may be regularized to "two colleges", or regularized to "two hundred eleven colleges", for such numbers may be displayed in a highlighted form for selection by the user, or a special phrase library may be pre-constructed, and when a phrase matched with the special phrase library appears in the text, the regularization is a corresponding phrase.

Step S02, splitting the first text into a plurality of sentences to be converted according to a preset identifier, and performing word segmentation on the sentences to be converted respectively to obtain a plurality of words to be converted.

In this embodiment, the preset identifier may be comma, period, etc., the first text is split into several sentences to be converted according to the comma and period, and further, the several sentences to be converted are respectively word-segmented, where the sentences to be converted may be word-segmented according to the constructed phrase library, and specifically, the phrase library may record the dictionary, and word-segment the sentences to be converted according to the phrases in the dictionary, for example, the regularized sentences are "one hundred and twelve in one-to-one university in the whole country", and the word-segmented sentences are "one hundred and twelve in one/two universities/".

Step S03, the pinyin of the word to be converted is obtained, the pinyin of each word in the word to be converted is marked according to four tones of the pinyin, and a first mark of the pinyin of each word is obtained, wherein at least one first mark exists in the pinyin of the word.

Firstly, establishing a first mapping model of tone symbols and each first label, wherein the first mapping model is used for inputting tone symbols and outputting corresponding first labels; recognizing the phonetic tone symbols of each word in the word to be converted, inputting the phonetic tone symbols of each word in the word to be converted into a first mapping model, outputting a corresponding first label, specifically, the phonetic tone shares one, two, three and four, namely, the phonetic tone, also called as yin Ping, yang Ping, shang and Fu, in the embodiment, the label corresponding to the first sound is 1, the label corresponding to the second sound is 2, the third sound is labeled 3, the fourth sound is labeled 4, and it can be understood that the spelling result of each word is "quan2 guo2 yi2 gong4 you3 yi4 bai3 yi1 shi2 er4 suo3 er4 yao1 yao1 gao xiao4", and it is noted that "two one to one" can be regarded as a special phrase, and the spelling result of each word is not "er4 yi1 yi1".

Step S04, splitting initials and finals in the pinyin of the characters, assigning a first label of the pinyin of the characters to the finals, and labeling the initials in the pinyin of the characters.

Continuing with the above example, the initials and finals in the pinyin of the word are split, namely, "q uan2 g uo2 y i g ong4 y ou3 y i b ai3 y i1 sh i2 er4 s uo3 er4 y ao1 y ao1 g ao1 x iao4", and the first label of the pinyin of the word is assigned to the finals, and then the initials in the pinyin of the word are labeled, namely, "q0 uan2 g0 uo2 y0 i2 g0 ong4 y0 ou3 y0 i4 b0 ai3 y0 i1 sh0 i2 i4 s0 uo3 er4 y0 ao1 y 1 ao1 g0 ao1 x0 iao4", in this embodiment, the initials are labeled as 0, wherein the initials are also attributed to a phoneme, and the initials are part of the necessary subsequent analysis processing. In addition, because the initial consonant labels are all 0, the initial consonant labels are not considered in the subsequent tone changing process, namely the processing is skipped.

And step S05, determining phoneme information according to a preset rule, wherein the phoneme information comprises target marks in the first marks.

Specifically, recognizing all words, and judging whether a target word exists, wherein the target word at least comprises one and no, and the recognition of the target word can be realized through an image processing method or a coding method and the like;

If the target word is judged to exist, when the target word is 'one', judging whether the target word is 'one' or not at the word tail, wherein for a special phrase, the 'one' is the corresponding pinyin, and the pronunciation is not changed;

if it is determined that the number of words represented by the first and/or the second target word does not exist, determining whether the word group of the first and the second target words is a graduated word, for example, the word group of the graduated word combined with the first word has a root, a bar, a seed, etc., and the words combined with the first word may be summarized in advance to form a database, and when the step is performed, the comparison is performed;

if the word combination of the target word 'one' and the next word is the graduated word, judging whether the tone of the next word is the fourth tone, namely judging whether the label of the next word is 4;

if the tone of the next word is judged to be the fourth tone, the target mark defining the target word "one" is marked as the mark corresponding to the second sound, namely, the mark is marked as 2, for example, when the graduated word is "one inch", the target mark of "one" is marked as 2;

if the tone of the next word after the target word is judged to be the fourth tone, the target label defining the target word of "no" is the label corresponding to the second sound, for example, "no".

If the target word 'one' is judged not to be in the word tail, the step of judging whether the number of words are represented before and/or after the target word 'one' further comprises the following steps:

if it is determined that the number of words represented by the front and/or rear of the target word "one" exists, determining whether the phrase including the target word "one" is a preset phrase, for example, the preset phrase may be "211" or the like;

In addition, the step of determining phoneme information according to a preset rule, where the phoneme information includes a target label in the first label further includes:

judging whether at least more than two continuous labels are third sounds in the words to be converted or not, namely judging whether at least two continuous labels are 3 or not, wherein the labeling result of the initial consonants is not considered;

If it is determined that at least two continuous labels are third sounds in the word to be converted, when the continuous labels of the two third sounds exist, modifying the label of the first third sound into the label of the second sound, for example, "hello";

if the corresponding three numbers are judged to be digital, modifying the labels of the first third sound and the second third sound into the labels of the second sound, for example, "999";

if the second word is judged to be combined with the first word, modifying the labels of the first third sound and the second third sound into the labels of the second sound, for example, a 'wrist watch factory';

if it is determined that the second word is combined with the third word, the second and third sound annotation is modified to a second sound annotation, e.g., "paper tiger".

When judging the combination of the adjacent words, the method can be matched with the phrases in the phrase library, and takes the paper tiger as an example to determine the combination of the adjacent words, and the paper tiger and the tiger are matched with the phrases in the phrase library, so that the obvious combination of the adjacent words is the tiger.

Step S06, recombining the phrases, and determining the pause time between the recombined phrases according to the speaking speed of the user.

Before the method, the total time for reading the preset text by the user and the time of each mark position in the preset text are obtained, wherein the marks in the preset text are used for spacing adjacent phrases, and it can be understood that each mark in the preset text is marked according to the speaking habit of the user, and can be marked by the user himself, so that the user reads the preset text, and the pause information of the mark position is collected intentionally to know the habit of the user;

calculating the average pronunciation time of the single word according to the total time, the time of each mark and the word number of the preset text, namely subtracting the time of each mark from the total time and dividing the time by the word number of the preset text to obtain the average pronunciation time of the single word;

clustering the time of each mark in the preset text according to a clustering algorithm, and determining a preset number of target marks according to a clustering result, wherein different target marks correspond to different pause times, and in addition, the pause time of each punctuation mark needs to be counted.

Specifically, the re-combined phrase may be determined according to the collected pause information, wherein, a mapping relationship between the number of words and the number of pauses is established according to the pause information, then, according to the mapping relationship, the number of pauses of sentences to which the phrase to be re-combined belongs is determined, if the number of pauses is smaller than the previous word segmentation result, taking "national/common/one hundred twelve/two one/university" as an example, wherein the number of word segmentation marks "/" is 6, and when the corresponding number of pauses is 4, 2 word segmentation marks "/", such as "national/common/one hundred twelve one/two one/university", are deleted so as to make the whole more natural and more in line with the speaking habit of the user.

Then, a first number of the word groups with punctuation marks being recombined among commas is obtained, and whether the first number is larger than or equal to a second number is judged, wherein the second number is the number of the target marks without the punctuation marks;

randomly and averagely inserting each selected target mark among each phrase.

And S07, converting the sound characteristics into target waveforms according to the words and the corresponding phoneme information, and completing sound cloning according to the target waveforms.

Specifically, characters/phonemes are converted into acoustic features through an acoustic model, such as a linear spectrogram, a mel spectrogram, an LPC feature, and the like, the acoustic model is mainly divided into an autoregressive model and a non-autoregressive model, wherein the autoregressive model comprises Tacotron, tacotron, a transducer TTS, and the like, the non-autoregressive model comprises FastSpeech, speedySpeech, fastPitch, fastSpech 2, and the like, the acoustic features are converted into waveforms through a vocoder, the vocoder is also divided into an autoregressive model and a non-autoregressive model, the autoregressive model comprises WaveNet, waveRNN, LPCNet, and the like, and the non-autoregressive model comprises Parallel WaveGAN, multi Band MelGAN, style MelGAN, hiFiGAN, and the like.

In summary, according to the sound cloning method based on artificial intelligence in the above embodiment of the present invention, the original text is regularized and sequentially converted into a plurality of sentences to be converted and a plurality of words to be converted, the pinyin of each word is obtained, the pinyin of each word is labeled to obtain a first label, the initials and finals in the pinyin of the word are split, the first label of the pinyin of the word is given to the finals, the initials in the pinyin of the word are labeled, the phoneme information is determined according to a preset rule, the phoneme information includes the target label in the first label, then the word group is recombined, the pause time between the recombined word groups is determined according to the speaking speed of the user, finally the acoustic feature is converted into the target waveform according to the word and the corresponding phoneme information, and the sound cloning is completed according to the target waveform.

Example two

Referring to fig. 2, fig. 2 is a block diagram of an artificial intelligence-based sound cloning system 200 according to a second embodiment of the present invention, where the artificial intelligence-based sound cloning system 200 is applied in a chinese scenario, and includes: the regularization processing module 21, the first splitting module 22, the labeling module 23, the second splitting module 24, the first determining module 25, the second determining module 26 and the converting module 27, wherein:

The regularization processing module 21 is configured to obtain an original text, and perform regularization processing on the original text to obtain a first text;

the first splitting module 22 is configured to split the first text into a plurality of sentences to be converted according to a preset identifier, and perform word segmentation processing on the sentences to be converted to obtain a plurality of words to be converted;

the labeling module 23 is configured to obtain pinyin of the word to be converted, and label the pinyin of each word in the word to be converted according to four tones of the pinyin to obtain a first label of the pinyin of each word, where at least one first label exists in the pinyin of the word;

a second splitting module 24, configured to split an initial consonant and a final in the pinyin of the word, assign a first label of the pinyin of the word to the final, and label the initial consonant in the pinyin of the word;

a first determining module 25, configured to determine phoneme information according to a preset rule, where the phoneme information includes a target label in the first label;

a second determining module 26, configured to recombine the phrases and determine a pause time between the recombined phrases according to the speaking speed of the user;

the conversion module 27 is configured to convert the acoustic features into target waveforms according to the words and the corresponding phoneme information, and complete sound cloning according to the target waveforms.

Further, in some alternative embodiments of the present invention, the labeling module 23 includes:

the first mapping model building unit is used for building a first mapping model of the tone symbol and each first annotation, wherein the first mapping model is used for inputting the tone symbol and outputting the corresponding first annotation;

the recognition unit is used for recognizing the tone symbols of the pinyin of each word in the word to be converted, inputting the tone symbols of the pinyin of each word in the word to be converted into the first mapping model, and outputting the corresponding first label.

Further, in some alternative embodiments of the present invention, the first determining module 25 includes:

a first judging unit, configured to identify all words and judge whether a target word exists, where the target word includes at least "one" and "no";

the second judging unit is used for judging whether the target word is at the tail of the word when the target word is 'one' if the target word is judged to be the target word;

a third judging unit, configured to judge whether a word representing a number exists before and/or after the target word "one" if it is judged that the target word "one" is not at the end of word;

a fourth judging unit, configured to judge whether the word group of the target word "one" and the next word is a graduated word if it is judged that the number of words represented before and/or after the target word "one" does not exist;

A fifth judging unit, configured to judge whether the tone of the next word is a fourth tone if it is judged that the word group of the target word "one" and the next word is a graduated word;

the first defining unit is used for defining a target mark of the target word 'one' as a mark corresponding to the second sound if the tone of the latter word is judged to be the fourth tone;

the second definition unit is used for defining a target mark of the target word 'one' as a mark corresponding to the fourth sound if the tone of the latter word is judged not to be the fourth sound;

a sixth judging unit, configured to judge whether the tone of the word following the "no" target word is the fourth tone when the target word is "no" if it is judged that the target word exists;

and the third definition unit is used for defining the target label of the target word 'no' as the label corresponding to the second sound if the tone of the next word after the target word 'no' is judged to be the fourth sound.

Further, in some optional embodiments of the present invention, the first determining module 25 further includes:

a seventh judging unit, configured to judge whether the phrase including the target word "one" is a preset phrase if it is judged that the number of words indicated before and/or after the target word "one" exists;

and the fourth definition unit is used for defining the target mark of the target word 'one' as the mark corresponding to the preset phrase if judging that the phrase comprising the target word 'one' is the preset phrase.

an eighth judging unit, configured to judge whether there are at least two consecutive labels in the word to be converted as third sounds, where a labeling result of the initial consonant is not considered;

the first modification unit is used for modifying the first annotation of the third sound into the second annotation of the second sound when the continuous annotation of the two third sounds exists if the fact that the at least two continuous annotations of the word to be converted exist are judged to be the third sounds;

a ninth judging unit, configured to judge whether the corresponding three words are numbers when there are continuous labels of three third sounds if it is judged that there are continuous at least two labels of the to-be-converted words as the third sounds;

the second modification unit is used for modifying the labels of the first third sound and the second third sound into the labels of the second sound if the corresponding three numbers are judged to be digital;

a tenth judging unit for combining the second word with the adjacent word if the corresponding three words are not numbers, and judging whether the second word is combined with the first word or the second word is combined with the third word;

A third modification unit, configured to modify, if it is determined that the second word is combined with the first word, the first third sound and the second third sound are both marked as second sound marks;

and a fourth modification unit, configured to modify the second and third sound annotation into a second sound annotation if it is determined that the second word is combined with the third word.

Further, in some alternative embodiments of the present invention, the artificial intelligence based sound cloning system 200 further comprises:

the device comprises an acquisition module, a judgment module and a judgment module, wherein the acquisition module is used for acquiring the total time for reading out a preset text by a user and the time of each mark in the preset text, wherein the marks in the preset text are used for spacing adjacent phrases;

the calculation module is used for calculating the average pronunciation time of a single word according to the total time, the time of each mark and the word number of the preset text;

and the third determining module is used for clustering the time of each mark in the preset text according to a clustering algorithm and determining a preset number of target marks according to a clustering result, wherein different target marks correspond to different pause times.

Further, in some alternative embodiments of the present invention, the second determining module 26 includes:

An eleventh judging unit, configured to obtain a first number of the recombined phrases between commas as punctuation marks, and judge whether the first number is greater than or equal to a second number, where the second number is a number of target marks from which the punctuation marks are removed;

the first inserting unit is used for randomly and averagely inserting target marks representing different pause times among the phrases when judging that the first number is larger than or equal to the second number;

the sorting unit is used for sorting all the target marks from small to large according to the pause time when the first number is not more than or equal to the second number, and sequentially selecting all the target marks with the same number as the first number;

and the second inserting unit is used for randomly and averagely inserting the selected target marks among the phrases.

Example III

In another aspect, referring to fig. 3, an electronic device according to a third embodiment of the present invention includes a memory 20, a processor 10, and a computer program 30 stored on the memory and capable of running on the processor, where the processor 10 implements the artificial intelligence-based sound cloning method described above when executing the computer program 30.

The processor 10 may be, among other things, a central processing unit (Central Processing Unit, CPU), a controller, a microcontroller, a microprocessor or other data processing chip for running program code or processing data stored in the memory 20, e.g. executing an access restriction program or the like, in some embodiments.

The memory 20 includes at least one type of readable storage medium including flash memory, a hard disk, a multimedia card, a card memory (e.g., SD or DX memory, etc.), a magnetic memory, a magnetic disk, an optical disk, etc. The memory 20 may in some embodiments be an internal storage unit of the electronic device, such as a hard disk of the electronic device. The memory 20 may also be an external storage device of the electronic device in other embodiments, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash Card (Flash Card) or the like. Further, the memory 20 may also include both internal storage units and external storage devices of the electronic device. The memory 20 may be used not only for storing application software of an electronic device and various types of data, but also for temporarily storing data that has been output or is to be output.

It should be noted that the structure shown in fig. 3 does not constitute a limitation of the electronic device, and in other embodiments the electronic device may comprise fewer or more components than shown, or may combine certain components, or may have a different arrangement of components.

The embodiment of the invention also provides a computer readable storage medium, on which a computer program is stored, which when being executed by a processor, implements the artificial intelligence based sound cloning method as described above.

Those of skill in the art will appreciate that the logic and/or steps represented in the flow diagrams or otherwise described herein, e.g., a ordered listing of executable instructions for implementing logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.

More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). In addition, the computer readable medium may even be paper or other suitable medium on which the program is printed, as the program may be electronically captured, via, for instance, optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner, if necessary, and then stored in a computer memory.

It is to be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, may be implemented using any one or combination of the following techniques, as is well known in the art: discrete logic circuits having logic gates for implementing logic functions on data signals, application specific integrated circuits having suitable combinational logic gates, programmable Gate Arrays (PGAs), field Programmable Gate Arrays (FPGAs), and the like.

In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiments or examples. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

The foregoing examples illustrate only a few embodiments of the invention, which are described in detail and are not to be construed as limiting the scope of the invention. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the invention, which are all within the scope of the invention. Accordingly, the scope of protection of the present invention is to be determined by the appended claims.

Claims

1. An artificial intelligence-based sound cloning method, which is applied to a Chinese scene, comprises the following steps:

2. The artificial intelligence based sound cloning method of claim 1, wherein the steps of obtaining the pinyin of the word to be converted, and labeling the pinyin of each word in the word to be converted according to four tones of the pinyin to obtain a first label of the pinyin of each word, wherein at least one first label exists in the pinyin of each word comprises:

3. The artificial intelligence based sound cloning method of claim 2, wherein the step of determining phoneme information including the target label in the first label according to a preset rule comprises:

4. The artificial intelligence based sound cloning method according to claim 3, wherein if it is determined that the target word "a" is not at the end of word, the step of determining whether the target word "a" is preceded and/or followed by the word representing the number further comprises:

5. The artificial intelligence based sound cloning method of claim 2, wherein the step of determining phoneme information according to a preset rule, the phoneme information including a target annotation in the first annotation further comprises:

6. The artificial intelligence based sound cloning method according to any one of claims 1 to 5, wherein the step of recombining the phrases and determining a pause time between the recombined phrases according to a user speaking speed comprises, before:

7. The artificial intelligence based sound cloning method of claim 6, wherein the step of recombining the phrases and determining a pause time between the recombined phrases according to a user speaking speed comprises:

randomly and averagely inserting each selected target mark among each phrase.

8. An artificial intelligence based sound cloning system for use in chinese scenes, the system comprising:

9. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the artificial intelligence based sound cloning method according to any one of claims 1-7.

10. An electronic device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the artificial intelligence based sound cloning method of any one of claims 1-7 when the program is executed.