CN116110370A - Speech synthesis system and related equipment based on man-machine speech interaction - Google Patents

Speech synthesis system and related equipment based on man-machine speech interaction Download PDF

Info

Publication number
CN116110370A
CN116110370A CN202310092201.8A CN202310092201A CN116110370A CN 116110370 A CN116110370 A CN 116110370A CN 202310092201 A CN202310092201 A CN 202310092201A CN 116110370 A CN116110370 A CN 116110370A
Authority
CN
China
Prior art keywords
pronunciation
voice
text
module
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310092201.8A
Other languages
Chinese (zh)
Inventor
孟廷
方昕
黄宜鑫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
iFlytek Co Ltd
Original Assignee
iFlytek Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by iFlytek Co Ltd filed Critical iFlytek Co Ltd
Priority to CN202310092201.8A priority Critical patent/CN116110370A/en
Publication of CN116110370A publication Critical patent/CN116110370A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L13/10Prosody rules derived from text; Stress or intonation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • G10L13/047Architecture of speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/183Speech classification or search using natural language modelling using context dependencies, e.g. language models
    • G10L15/187Phonemic context, e.g. pronunciation rules, phonotactical constraints or phoneme n-grams
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0631Creating reference templates; Clustering
    • G10L2015/0633Creating reference templates; Clustering using lexical or orthographic knowledge sources
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • G10L2015/0638Interactive procedures
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • G10L2015/223Execution procedure of a spoken command

Abstract

The application belongs to the technical field of voice synthesis, and provides a voice synthesis system, a terminal and a computer readable storage medium based on man-machine voice interaction, which are used for solving the problem of poor voice synthesis effect based on man-machine voice interaction in the prior art.

Description

Speech synthesis system and related equipment based on man-machine speech interaction
Technical Field
The present disclosure relates to the field of speech synthesis technologies, and in particular, to a speech synthesis system, a terminal, and a computer readable storage medium based on human-computer speech interaction.
Background
The voice is the most convenient and effective means for people to acquire and communicate information, and the man-machine voice interaction refers to the process of information transmission between people and equipment through natural voice. With the popularization of intelligent hardware in recent years, the application of man-machine voice interaction has been rapidly developed, and the technology is widely applied to medical treatment, customer service, education, intelligent home, mobile equipment, vehicle-mounted and other scenes.
In the man-machine voice interaction technology, voice sent by a user is firstly converted into text by a voice recognition technology, and information finally generated by equipment is required to be synthesized through voice and subjected to voice broadcasting, so that the information is transmitted to the user, the voice is recognized as a starting point in a man-machine interaction scene, and the voice is synthesized as an ending point in the man-machine interaction scene.
In man-machine voice interaction, voice synthesis of related texts such as personal names, place names and the like is always one of the difficulties of voice synthesis, and the text is usually sparse or even does not appear in training corpus, but a user may relate to the entity words in the man-machine interaction process.
In the traditional technology, the voice synthesis in the man-machine voice interaction is generally independent processing on the text, namely, the pronunciation of the text is determined by adopting a corresponding preset rule, and the pronunciation is formed into voice to be played to a user, but because the text is generally sparse or even does not appear in a training corpus, the pronunciation of the text is inconsistent with the actual pronunciation of the user, so that the effect of the voice synthesis in the man-machine voice interaction is poor, and the effect of the man-machine voice interaction is reduced.
Disclosure of Invention
The application provides a voice synthesis system, a terminal and a computer readable storage medium based on man-machine voice interaction, which can solve the technical problem of poor voice synthesis effect based on man-machine voice interaction in the traditional technology.
In a first aspect, the present application provides a speech synthesis system based on human-computer speech interaction, including: the voice recognition module is used for acquiring voice input by a first user, and carrying out voice recognition on the voice to obtain a recognition text and a pronunciation sequence corresponding to the recognition text; the pronunciation extraction module is used for acquiring pronunciation corresponding to the preset type target text according to the pronunciation sequence under the condition that the identified text contains the preset type target text; the text generation module is used for acquiring a target recognition text corresponding to a second user and generating a response text corresponding to the target recognition text according to the target recognition text, wherein the target recognition text describes a text obtained by recognizing the voice input by the second user; and the voice synthesis module is used for synthesizing the response voice corresponding to the response text by adopting the pronunciation under the condition that the response text contains the target text of the preset type.
In a second aspect, the present application provides a terminal, which includes a memory and a processor, where the memory stores a computer program, and the processor executes the computer program to perform the speech synthesis system based on man-machine speech interaction.
In a third aspect, the present application provides a computer readable storage medium storing a computer program which, when executed by a processor, causes the processor to perform the human-machine speech interaction based speech synthesis system.
The utility model provides a voice synthesis system, terminal and computer readable storage medium based on human-computer voice interaction, the system is through extracting the pronunciation that preset type target text and the corresponding that will contain in the pronunciation of first user input in the scene based on human-computer voice interaction, and under the condition that the preset type target text is involved in carrying out voice synthesis, carry out voice synthesis based on the pronunciation that preset type target text corresponds to, can make the pronunciation that preset type target text that contains in the pronunciation of first user input run through the whole process of human-computer voice interaction, satisfy the voice synthesis of different user's human-computer voice interaction, thereby make the pronunciation that preset type target text corresponds in the synthetic voice more true reflect the actual pronunciation of user to preset type target text, compare with traditional technique with voice synthesis as two independent links, the voice recognition in the scene of human-computer voice interaction and voice synthesis combine together, the pronunciation that the preset type target text that will be involved in the pronunciation of user input carries into from voice recognition in the whole process of the voice interaction, thereby the pronunciation that can guarantee to have more true voice recognition in the corresponding to the corresponding actual pronunciation of the corresponding type target text in the speech synthesis, can more true voice interaction effect can be more accurately solved, thereby the problem can be solved in the speech has more general, and the speech has better real-computer interaction type can be more true than in the speech, and more true voice can be compared.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a schematic diagram of a voice synthesis system based on man-machine voice interaction according to an embodiment of the present application;
fig. 2 is a schematic diagram of a pronunciation generation flow of a response text of a voice synthesis system based on man-machine voice interaction according to an embodiment of the present application;
FIG. 3 is another schematic diagram of a voice synthesis system based on human-computer voice interaction according to an embodiment of the present disclosure;
FIG. 4 is a schematic diagram of an example operation flow of a speech synthesis system based on human-computer speech interaction according to an embodiment of the present application;
fig. 5 is a schematic block diagram of a terminal provided in an embodiment of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.
It should be understood that the terms "comprises" and "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
The embodiment of the application provides a voice synthesis system based on man-machine voice interaction, the system can be applied to equipment based on man-machine voice interaction, the equipment comprises but is not limited to terminals such as intelligent robots, intelligent mobile phones and intelligent watches or servers such as cloud terminals, and the system is applied to scenes such as medical treatment, customer service, education, intelligent home, mobile equipment and vehicle-mounted equipment, and provides a voice interaction process based on man-machine voice interaction for users.
In order to solve the technical problem of poor voice synthesis effect based on man-machine voice interaction in the conventional technology, the inventor provides a voice synthesis system based on man-machine voice interaction in the embodiment of the application, and the core idea of the embodiment of the application is as follows: in a human-computer voice interaction scene, voice input by a first user is acquired, when voice recognition is carried out on the voice, a data pair consisting of a preset type target text and corresponding pronunciation is acquired, when a response text corresponding to the voice of a second user contains the preset type target text, the response voice corresponding to the response text is synthesized by adopting pronunciation, and then the response voice is reported to the user to carry out human-computer voice interaction.
According to the method and the device, the pronunciation corresponding to the preset type target text contained in the voice is obtained through the voice input by the first user, and the response voice corresponding to the pronunciation synthesis response text is adopted under the condition that the response text corresponding to the voice of the second user contains the preset type target text, so that the synthesized voice provided by the first user is dynamically optimized and fed back to the synthesized voice of the corresponding user, the actual pronunciation of the preset type target text by the user in the synthesized voice can be reflected more truly, the problem that the corresponding preset type target text is sparse in training corpus generally and even does not appear is solved, the synthesized voice is more accurate, more true and more vivid, and the man-machine voice interaction effect can be improved.
Some embodiments of the present application are described in detail below with reference to the accompanying drawings. The following examples and features of the examples may be combined with each other to construct different implementations without conflict.
Referring to fig. 1 and fig. 2, fig. 1 is a schematic diagram of the composition of a speech synthesis system based on human-computer interaction provided in an embodiment of the present application, and fig. 2 is a schematic diagram of a pronunciation generation flow of a response text of the speech synthesis system based on human-computer interaction provided in an embodiment of the present application. As shown in fig. 1, the system includes, but is not limited to, the following modules 101 to 104:
The voice recognition module 101 is configured to obtain a voice input by a first user, and perform voice recognition on the voice to obtain a recognition text and a pronunciation sequence corresponding to the recognition text.
In a human-computer-voice interaction scenario, for example, human-computer interaction is generally started by a user through voice, in this case, a device that performs voice interaction with the user may obtain voice input by the first user through a voice input device such as a microphone, and perform voice recognition on the voice to obtain a recognition text and a pronunciation sequence corresponding to the recognition text, where the voice recognition is also called an automatic voice recognition technology, and english is Automatic Speech Recognition, abbreviated as ASR, and describes a technology for converting human voice into text.
In the process of voice recognition, firstly, extracting features of voice input by a first user, and then, according to the extracted features, combining an acoustic model, a pronunciation dictionary and a language model, carrying out decoding search on the features extracted from the voice input by the first user to obtain a recognition text, in the process, collecting and retaining a pronunciation sequence corresponding to the recognition text to obtain the recognition text and a pronunciation sequence corresponding to the recognition text, wherein the pronunciation sequence describes an ordered pronunciation set formed by pronunciation of the recognition text, and comprises, but not limited to, an initial consonant sequence comprising initial consonants and final consonants and a phonetic symbol sequence comprising vowels and consonants.
For example, in the case where the voice input by the user is chinese voice and in the case where the recognition text corresponding to the voice input by the first user is "phone number for dialing a small music", the corresponding initial and final sequence may be "bo1da1xiao3yue de3dia 4hua4hao5ma3", so as to obtain a data pair consisting of "phone number for dialing a small music" and "bo1da1xiao3yue4de3dia 4hua4hao5ma 3". In the case where the voice input by the user is english voice, the pronunciation sequence is a phonetic symbol sequence including vowels and consonants.
And the pronunciation extraction module 102 is configured to obtain, according to the pronunciation sequence, a pronunciation corresponding to the preset type target text when the identified text includes the preset type target text.
In an exemplary case that the recognition text includes a preset type target text, a pronunciation corresponding to the preset type target text is obtained according to a pronunciation sequence, wherein the preset type target text describes a text which is determined as a target according to a service requirement of speech synthesis, the preset type target text can describe a text which is sparse or even does not appear in a training corpus, the preset type target text includes but is not limited to entity words, wherein the entity words are also called named entity recognition, english is Named Entity Recognition, abbreviated NER, also called "special name recognition", and refer to an entity with a specific meaning in the recognition text, and mainly include a name of a person, a place name, a proper noun, a mechanism name and the like.
Judging whether the identification text contains the preset type target text or not can be carried out based on content matching of the rule template, wherein the content matching based on the rule template mainly carries out matching through a regular matching item to judge whether the identification text contains the preset type target text or not, or can also be carried out based on a matching mode of a preset text identification model, wherein the matching mode based on the preset text identification model mainly adopts a training sample to train the preset text model, and the trained preset text model is adopted to judge whether the identification text contains the preset type target text or not.
Illustratively, a content matching method based on a rule template is adopted to judge whether the identification text contains a preset type target text, so that in the case of matching to corresponding content, the identification text is determined to contain the preset type target text. For example, a rule template in a rule base has the following rules: when the phone number of the "phone call" is the recognition text corresponding to the voice input by the first user, the "phone number of the" phone call "can be recognized that the" phone call "contained in the recognition text is the preset type target text, and meanwhile, based on the above example, the initial consonant corresponding to the" phone call "is" xiao3yue4", that is, the pronunciation corresponding to the" phone call "is" xiao3yue4".
The text generation module 103 is configured to obtain a target recognition text corresponding to a second user, and generate a response text corresponding to the target recognition text according to the target recognition text, where the target recognition text describes a text obtained by recognizing a voice input by the second user.
In a man-machine voice interaction scenario, the voice input by the second user is acquired and is subjected to voice recognition, so that a corresponding target recognition text is obtained, namely, the target recognition text corresponding to the second user is generated, a corresponding response text is generated according to the target recognition text, generally, the determined action is generated into the response text replied to the user according to natural language generation (Natural Language Generation, NLG), the target recognition text can be subjected to semantic understanding, then the response text corresponding to the target recognition text is generated according to the result of the semantic understanding, or the response text corresponding to the target recognition text is generated according to content matching based on a rule template, and the response text describes the text responding to the voice input by the second user.
It should be noted that "first" and "second" in the first user and the second user are merely for distinguishing different users, and are not limited to different users.
Wherein the second user may be one of the following: 1) The second user is the same user as the first user, in which case the target recognition text and the recognition text may refer to the same text; 2) The second user and the first user are different users, and in this case, the target recognition text and the recognition text are not the same text, so that the pronunciation corresponding to the target text of the preset type is effective for all users.
For example, in the case of generating a response text corresponding to the target recognition text based on content matching of the rule template, the rule template may be understood as a regular matching term, and in the case of matching to a rule according to the target recognition text, a corresponding action is performed.
For example, referring to table 1 below, as shown in table 1, templates as shown in table 1 exist in the rule base:
table 1
Rules of Action
Dialing a telephone Is calling for you
As shown in table 1, the x in table 1 indicates that any text is matched, and according to the rule template, in the case where the target recognition text is "call you", the corresponding response text is "call you", and as shown in the above example, in the case where "x" in table 1 is matched to "small music", that is, in the case where the target recognition text is "phone number dialing small music", the corresponding response text may be "call you for small music".
And the voice synthesis module 104 is configured to synthesize a response voice corresponding to the response text by using the pronunciation when the response text includes the target text of the preset type.
The Speech synthesis module is configured to synthesize, according to pronunciation, a response Speech corresponding to the response Text by using the pronunciation as a corresponding content in the response Speech, and generally convert the response Text into a corresponding Speech (Voice) by using Speech synthesis (also called Text To Speech (TTS)), where the response Speech describes a synthesized Speech that responds to a Speech input by the second user in an application scenario of man-machine Speech interaction. Further, the response voice can be played to the second user, so that man-machine voice interaction is performed.
Further, the speech synthesis may include a synthesis front end and a synthesis back end, where the synthesis front end and the synthesis back end describe different stages, different units, or different components of the speech synthesis respectively, and are not limited to describe different devices, the synthesis front end converts an input text into a pronunciation tag (pronunciation label) by using a pronunciation dictionary through text normalization, word segmentation, and other processes, for example, the chinese text is converted into a sequence of initial consonants, the synthesis back end generally includes two parts of an acoustic model and a vocoder, the acoustic model is responsible for converting the pronunciation label into a speech intermediate representation (such as a mel spectrum, a filebank feature, and the like), and the vocoder is responsible for converting the speech intermediate representation into a speech waveform point, so as to obtain a corresponding response speech.
Referring to fig. 2, as shown in fig. 2, in this example, before the user-entered speech is not combined, the possible pronunciation descriptions of the determined response text are as follows:
< now > [ zheng4zai4] < is > [ wei2] < you > [ nin2] < call > [ hu2jiao4] < Xiao le > [ xiao3le4];
with continued reference to fig. 2, after the user inputs speech, the pronunciation of the determined response text is described as follows:
< now > [ zheng4zai4] < is > [ wei2] < you > [ nin2] < call > [ hu2jiao4] < Xiao le > [ xiao3yue4];
therefore, the actual pronunciation of the preset type target text by the user can be reflected by combining the pronunciation of the response text after the user inputs the voice, and particularly, the situation that the corresponding preset type target text is generally sparse or even does not appear in the training corpus can be faced, so that the synthesized voice is more accurate, more real and more vivid, and the man-machine voice interaction effect is improved.
According to the embodiment of the application, the preset type target text and the pronunciation corresponding to the preset type target text are extracted from the scene based on human-computer voice interaction, and under the condition that the preset type target text is involved in voice synthesis, voice synthesis is carried out based on the pronunciation corresponding to the preset type target text, so that the pronunciation corresponding to the preset type target text contained in the voice input by the first user can penetrate through the whole process of human-computer voice interaction, voice synthesis of human-computer voice interaction of different users is met, actual pronunciation of the preset type target text in synthesized voice can be reflected more truly, compared with the traditional technology, voice recognition and voice synthesis are used as two independent links, and in the embodiment of the application, the pronunciation corresponding to the preset type target text in the voice input by the user is brought into voice synthesis from voice recognition, the fact that the pronunciation corresponding to the actual pronunciation of the preset type target text in the synthesized response voice is consistent with the actual pronunciation of the corresponding user is guaranteed, therefore the human-computer interaction effect can be improved more accurately, and the human-computer interaction can be solved in the speech with better training effect, and the human-computer interaction can be realized normally.
In an embodiment, the speech synthesis system further comprises:
and the pronunciation library module is used for storing pronunciation corresponding to the target text of the preset type.
The pronunciation library module is connected with the pronunciation extraction module and is connected with the voice synthesis module and used for storing pronunciations corresponding to the target texts of the preset types extracted from the voice input by the first user, and can be used for accumulating pronunciations corresponding to different target texts of the preset types continuously, so that pronunciations corresponding to texts which are usually sparse or even not appeared in the training corpus are accumulated continuously, and the pronunciation library can be an entity word pronunciation library and is used for storing pronunciations corresponding to the extracted entity words.
Referring to fig. 3, fig. 3 is another schematic diagram of a voice synthesis system based on man-machine voice interaction according to an embodiment of the present application. As shown in fig. 3, in this example, a pronunciation library module 105 provided in a voice synthesis system based on man-machine voice interaction is connected to the pronunciation extraction module 102, so as to store a pronunciation corresponding to a preset type target text extracted from a voice input by a first user, and is connected to the voice synthesis module 104 through the pronunciation dictionary module 106, so that the pronunciation in the pronunciation library module 105 is applied to a voice synthesis process corresponding to the voice synthesis module 104, and thus an actual pronunciation of the preset type target text by the user is fed back to the voice synthesis, so that the pronunciation in the synthesized voice is more accurate and real, and the effect of man-machine voice interaction is achieved.
According to the embodiment of the application, the corresponding pronunciation library is set to store the pronunciation corresponding to the target text of the preset type, so that the pronunciation synthesis is performed based on the pronunciation in the pronunciation library, the pronunciation accuracy and the authenticity of different target texts of the preset type in the synthesized pronunciation can be further improved, the synthesized pronunciation is further accurate, more true and more vivid, and the man-machine-voice interaction effect can be further improved.
In an embodiment, the speech synthesis system further comprises:
a pronunciation dictionary module for storing mappings from individual words to phonemes;
and the pronunciation screening module is used for screening out the pronunciation meeting the preset pronunciation screening condition from the pronunciation contained in the pronunciation library module and adding the screened pronunciation to the pronunciation dictionary module.
The speech synthesis system further comprises a pronunciation dictionary module and a pronunciation screening module, wherein the pronunciation screening module is respectively connected with the pronunciation dictionary module and the pronunciation library module, and the pronunciation dictionary module (Lexicon) comprises a mapping from a single word (Words) to phonemes (Phones) and is used for connecting an acoustic model and a language model, and illustratively, in Chinese, the pronunciation dictionary comprises initial consonants corresponding to the Chinese Words. And the pronunciation screening module is used for screening out the pronunciation meeting the preset pronunciation screening condition from the pronunciation contained in the pronunciation library module, and adding the screened pronunciation to the pronunciation dictionary module, so that the pronunciation of the preset type target text extracted from the pronunciation of the user is added to the pronunciation dictionary module, and the pronunciation of the preset type target text can be made into a universal pronunciation candidate item so as to be adopted in man-machine-voice interaction scenes of different users, and further, the pronunciation screening module takes effect on all users who perform man-machine-voice interaction, wherein the preset pronunciation screening condition comprises, but is not limited to, the use frequency of pronunciation, the occurrence frequency of pronunciation, the use scene of pronunciation and the source of pronunciation.
Referring to fig. 3 and fig. 4, fig. 4 is a schematic diagram illustrating an example operation flow of a voice synthesis system based on man-machine voice interaction according to an embodiment of the present application. As shown in fig. 3, the speech synthesis system based on man-machine speech interaction includes a pronunciation library module 105 and a pronunciation dictionary module 106, a pronunciation screening module is disposed between the pronunciation library module 105 and the pronunciation dictionary module 106, and the pronunciation screening module is respectively connected with the pronunciation library module 105 and the pronunciation dictionary module 106, and the pronunciation screening module screens out the pronunciation satisfying the preset pronunciation screening condition from the pronunciation contained in the pronunciation library module 105 and adds the screened pronunciation to the pronunciation dictionary module 106. As shown in fig. 4, in this example, the user inputs a voice to perform voice recognition, and adds the pronunciation of the target text of the preset type recognized by the voice to the pronunciation library module, then screens out the pronunciation meeting the preset pronunciation screening condition from the pronunciation library module, adds to the pronunciation dictionary module, and then when generating the pronunciation of the response text, acquires the corresponding pronunciation from the pronunciation dictionary module, as shown in the above example, in the case that the received user's voice is "phone number dialing a small music (bo 1da1xiao3yue de3dian4hua4hao5ma 3)", adds "small music (xiao 3yue 4)" to the pronunciation library module, and adds "small music (xiao 3yue 4)" to the pronunciation dictionary module, acquires "small music (xiao 3 yue)", and generates the pronunciation corresponding to the response text as follows:
Zheng4zai4wei2nin2hu2jiao4[xiao3yue4];
Then the pronunciation of 'Zheng 4zai4wei2nin2hu2jiao4[ xiao3yue4 ]' is synthesized by voice to generate response voice, and the response voice can be broadcasted to clients to realize man-machine voice interaction.
According to the embodiment of the application, the pronunciation meeting the preset pronunciation screening conditions is screened out from the pronunciations contained in the pronunciation library module through the pronunciation screening module, and the screened pronunciation is added to the pronunciation dictionary module, so that the pronunciation of the target text of the preset type can be made into a universal pronunciation candidate item to be adopted in the human-computer voice interaction scene of different users, and further different users who perform human-computer voice interaction take effect, so that the synthetic voices of the human-computer voice interaction of different users are more accurate, more real and more vivid, and the effect of the human-computer voice interaction of different users can be improved.
Further, the pronunciation screening module includes:
the pronunciation screening condition unit is used for counting the occurrence frequency corresponding to the occurrence of the target pronunciation contained in the pronunciation library module, or counting the occurrence frequency corresponding to the occurrence of the target pronunciation in the pronunciation library module;
and the pronunciation adding unit is used for adding the pronunciation screened by the pronunciation screening condition unit to the pronunciation dictionary module.
The pronunciation screening module includes a pronunciation screening condition unit and a pronunciation adding unit, where the pronunciation screening condition unit is configured to count occurrence frequency of the target pronunciation included in the pronunciation library module, or is configured to count occurrence times of the target pronunciation in the pronunciation library module, and add the pronunciation screened by the pronunciation screening condition unit to the pronunciation dictionary module through the pronunciation adding unit, for example, add the target pronunciation to the pronunciation dictionary module when the occurrence frequency is greater than or equal to a preset occurrence frequency threshold, or add the target pronunciation to the pronunciation dictionary module when the occurrence times is greater than or equal to a preset occurrence times threshold.
According to the embodiment of the application, the pronunciation contained in the pronunciation library module is screened through the pronunciation screening condition unit, and the pronunciation screened according to the pronunciation screening condition unit is added to the pronunciation dictionary module, so that the screening and filtering of the pronunciation added to the pronunciation dictionary module can be realized, more redundancy and interference of the pronunciation in the pronunciation dictionary module can be avoided, the pronunciation in the pronunciation dictionary module is ensured to have applicability and practicability, and the pronunciation dictionary can be dynamically modified by counting and analyzing the corresponding preset type target text and the pronunciation thereof contained in the voice input by the user, so that the pronunciation of the corresponding preset type target text involved in the voice synthesis is dynamically optimized, the actual pronunciation of the user on the preset type target text in the synthesized voice can be reflected more truly, the problem that the corresponding preset type target text is usually sparse or even does not appear in the training corpus is solved, the synthesized voice can be more accurate, more true and more vivid, and the man-machine voice interaction effect is improved.
Further, the pronunciation screening module includes:
the frequency statistics subunit is used for counting the occurrence frequency corresponding to the occurrence of the target pronunciation contained in the pronunciation library module;
the frequency counting subunit is used for counting the occurrence frequency corresponding to the occurrence of the target pronunciation in the pronunciation library module;
and the pronunciation adding unit is used for adding the target pronunciation to the pronunciation dictionary module under the condition that the occurrence frequency is greater than or equal to a preset occurrence frequency threshold value and the occurrence frequency is greater than or equal to a preset occurrence frequency threshold value.
The pronunciation screening module includes a frequency counting unit, a frequency counting unit and a pronunciation adding unit, wherein the proportion counting unit is used for counting the occurrence frequency of the target pronunciation contained in the pronunciation library module, corresponding to the occurrence of the target pronunciation in the pronunciation library module, the frequency counting unit is used for counting the occurrence frequency of the target pronunciation in the pronunciation library module, and the pronunciation adding unit is used for adding the target pronunciation to the pronunciation dictionary module when the occurrence frequency is greater than or equal to a preset occurrence frequency threshold value and the occurrence frequency is greater than or equal to a preset occurrence frequency threshold value, so that for the situation that a plurality of pronunciations exist in a preset type of target text, such as pronunciations of different users or pronunciations of multiple words, a plurality of pronunciations caused by different situations are screened out to be used as prediction, and the pronunciation effect of the subsequent corresponding synthesized voice is more accurate, true and natural as much as possible. For example, for "small music" in the above example, in the case where "music" is polyphone, in the case where "xiao3le4" and "xiao3 yue" exist, [ xiao3yue4] is selected as the pronunciation of "small music" based on the voice input by the user.
Referring to table 2 below, as shown in the example of table 2, the pronunciations extracted from the pronunciation extraction module are added to the pronunciation library module at least in the form of data pairs, that is, the pronunciations contained in the pronunciation library module are retained at least in the form of "word-pronunciation" of data pairs, and in the case that the pronunciation library module contains word 1, word 2 and word 3, the occurrence frequency of pronunciation 1 corresponding to word 1 is counted as n11, the occurrence frequency of pronunciation 2 corresponding to word 1 is counted as n12, the occurrence frequency of pronunciation 3 corresponding to word 1 is counted as n13, the occurrence frequency of pronunciation 1 corresponding to word 2 is counted as n21, the occurrence frequency of pronunciation 1 corresponding to word 3 is counted as n31, so that whether the corresponding term is added to the pronunciation dictionary module can be judged according to the occurrence frequency of each word, and for the case that multiple pronunciations exist, a corresponding screening rule is required according to the actual requirement of specific business to ensure that the determined pronunciations can be screened for subsequent speech synthesis.
Table 2
Figure BDA0004070687240000121
According to the embodiment of the application, the frequency statistics unit is used for counting the occurrence frequency of the target pronunciation contained in the pronunciation library module, the frequency statistics unit is used for counting the occurrence frequency of the target pronunciation in the pronunciation library module, and the pronunciation adding unit is used for adding the target pronunciation to the pronunciation dictionary module when the occurrence frequency is greater than or equal to the preset occurrence frequency threshold value and the occurrence frequency is greater than or equal to the preset occurrence frequency threshold value, so that stricter screening and filtering of the pronunciation added to the pronunciation dictionary module can be realized, more redundancy and interference of the pronunciation in the pronunciation dictionary module can be further avoided, the greater possibility of applicability and practicability of the pronunciation in the pronunciation dictionary module can be ensured, the pronunciation dictionary can be dynamically modified by counting and analyzing the corresponding preset type target text and the pronunciation contained in the voice input by a user, and the pronunciation dictionary can be dynamically optimized, the actual pronunciation of the corresponding preset type target text in the voice synthesis can be further and more truly reflected by the preset type target text in the synthesized voice, the problem that the corresponding preset type target text is sparse and the corresponding preset type target text in the training is even more true, the human-computer interaction effect can be further improved, and the human-computer interaction effect can be even more accurately generated.
In an embodiment, the pronunciation extraction module 102 is further configured to obtain voice feature information corresponding to the pronunciation;
the speech synthesis module 104 is further configured to synthesize, when the response text includes the target text of the preset type, a response speech corresponding to the response text by using the speech feature information and the pronunciation.
In an exemplary case where the recognized text includes a target text of a preset type, according to the pronunciation sequence, not only the pronunciation corresponding to the target text of the preset type can be obtained, but also voice characteristic information such as speech speed, volume, pause and the like corresponding to the pronunciation can be obtained, and the voice characteristic information is used as additional information of the pronunciation, and is used for generating the synthesized voice later, so that the generated synthesized voice is richer and truer.
According to the embodiment of the application, the voice characteristic information corresponding to the pronunciation is obtained through the pronunciation extraction module, and the response voice corresponding to the response text is synthesized through the voice characteristic information and the pronunciation under the condition that the response text contains the preset type target text, so that the synthesized voice is further accurate, more real and more vivid, and the man-machine voice interaction effect is further improved.
It should be noted that, the voice synthesis system based on man-machine voice interaction described in the foregoing embodiments may recombine the technical features included in the different embodiments as needed to obtain a combined implementation, which is within the scope of protection claimed in the present application.
Meanwhile, the division and connection modes of the parts in the voice synthesis system based on the man-machine voice interaction are only used for illustration, and in other embodiments, the voice synthesis system based on the man-machine voice interaction can be divided into different parts according to the needs, and different connection sequences and modes can be adopted for the parts in the voice synthesis system based on the man-machine voice interaction so as to complete all or part of functions of the voice synthesis system based on the man-machine voice interaction.
The above-described human-machine-speech-interaction-based speech synthesis system may be implemented in the form of a computer program which can be run on a terminal as shown in fig. 5.
Referring to fig. 5, fig. 5 is a schematic block diagram of a terminal according to an embodiment of the present application. The terminal 500 may be a terminal such as a smart phone, a tablet computer, a notebook computer, or a desktop computer, or may be a component or a part of other devices.
Referring to fig. 5, the terminal 500 includes a processor 502, a memory, and a network interface 505, which are connected through a system bus 501, wherein the memory may include a non-volatile storage medium 503 and an internal memory 504, which may also be a volatile storage medium.
The non-volatile storage medium 503 may store an operating system 5031 and a computer program 5032. The computer program 5032, when executed, causes the processor 502 to perform a speech synthesis system based on human-machine speech interaction as described above.
The processor 502 is used to provide computing and control capabilities to support the operation of the overall terminal 500.
The internal memory 504 provides an environment for the execution of a computer program 5032 in the non-volatile storage medium 503, which computer program 5032, when executed by the processor 502, causes the processor 502 to perform a speech synthesis system based on human-computer speech interaction as described above.
The network interface 505 is used for network communication with other devices. It will be appreciated by those skilled in the art that the structure shown in fig. 5 is merely a block diagram of a portion of the structure associated with the present application and is not limiting of the terminal 500 to which the present application is applied, and that a particular terminal 500 may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components. For example, in some embodiments, the terminal may include only a memory and a processor, and in such embodiments, the structure and functions of the memory and the processor are consistent with those of the embodiment shown in fig. 5, which is not described herein.
Wherein the processor 502 is configured to run a computer program 5032 stored in a memory to implement a human-machine speech interaction based speech synthesis system as described above.
It should be appreciated that in embodiments of the present application, the processor 502 may be a central processing unit (Central Processing Unit, CPU), the processor 502 may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSPs), application specific integrated circuits (Application Specific Integrated Circuit, ASICs), off-the-shelf programmable gate arrays (Field-Programmable Gate Array, FPGAs) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. Wherein the general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
It will be appreciated by those skilled in the art that all or part of the flow of the method of the above embodiments may be implemented by a computer program, which may be stored on a computer readable storage medium. The computer program is executed by at least one processor in the computer system to implement the flow steps of the embodiments of the method described above.
Accordingly, the present application also provides a computer-readable storage medium. The computer readable storage medium may be a nonvolatile computer readable storage medium or a volatile computer readable storage medium, and the computer readable storage medium stores a computer program, and when executed by a processor, causes the processor to execute the steps of:
a computer program product which, when run on a computer, causes the computer to perform the human-machine speech interaction based speech synthesis system described in the above embodiments.
The computer readable storage medium may be an internal storage unit of the aforementioned device, such as a hard disk or a memory of the device. The computer readable storage medium may also be an external storage device of the device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card) or the like, which are provided on the device. Further, the computer readable storage medium may also include both internal storage units and external storage devices of the device.
It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the apparatus, system and module described above may refer to corresponding procedures in the foregoing embodiments, which are not repeated herein.
The storage medium is a physical, non-transitory storage medium, and may be, for example, a U-disk, a removable hard disk, a Read-Only Memory (ROM), a magnetic disk, or an optical disk.
Those of ordinary skill in the art will appreciate that the elements and algorithms of the examples described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the elements and algorithms of the examples have been described above generally in terms of functionality in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
In the several embodiments provided in this application, it should be understood that the disclosed system may be implemented in other ways. For example, the system embodiments described above are merely illustrative. For example, each module is divided into only one logic function, and there may be another division manner in actual implementation. For example, multiple modules or components may be combined or may be integrated into another system, or some features may be omitted, or not performed.
The system of the embodiment of the application can sequentially adjust, combine and prune according to actual needs. The modules in the system of the embodiment of the application can be combined, divided and pruned according to actual needs. In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.
The integrated unit may be stored in a storage medium if implemented in the form of a software functional unit and sold or used as a stand-alone product. Based on such understanding, the technical solution of the present application is essentially or a part contributing to the prior art, or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium, comprising several instructions for causing an electronic device (which may be a personal computer, a terminal, a network device, or the like) to perform all or part of the steps of the method described in the embodiments of the present application.
While the invention has been described with reference to certain preferred embodiments, it will be understood by those skilled in the art that various changes and substitutions of equivalents may be made and equivalents will be apparent to those skilled in the art without departing from the scope of the invention. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (10)

1. A human-machine-voice-interaction-based voice synthesis system, comprising:
the voice recognition module is used for acquiring voice input by a first user, and carrying out voice recognition on the voice to obtain a recognition text and a pronunciation sequence corresponding to the recognition text;
the pronunciation extraction module is used for acquiring pronunciation corresponding to the preset type target text according to the pronunciation sequence under the condition that the identified text contains the preset type target text;
the text generation module is used for acquiring a target recognition text corresponding to a second user and generating a response text corresponding to the target recognition text according to the target recognition text, wherein the target recognition text describes a text obtained by recognizing the voice input by the second user;
and the voice synthesis module is used for synthesizing the response voice corresponding to the response text by adopting the pronunciation under the condition that the response text contains the target text of the preset type.
2. The human-machine-voice-interaction-based speech synthesis system according to claim 1, further comprising:
and the pronunciation library module is used for storing pronunciation corresponding to the target text of the preset type.
3. The human-machine-voice-interaction-based speech synthesis system according to claim 2, further comprising:
a pronunciation dictionary module for storing mappings from individual words to phonemes;
and the pronunciation screening module is used for screening out the pronunciation meeting the preset pronunciation screening condition from the pronunciation contained in the pronunciation library module and adding the screened pronunciation to the pronunciation dictionary module.
4. A speech synthesis system according to claim 3, wherein the pronunciation screening module comprises:
the pronunciation screening condition unit is used for counting the occurrence frequency corresponding to the occurrence of the target pronunciation contained in the pronunciation library module, or counting the occurrence frequency corresponding to the occurrence of the target pronunciation in the pronunciation library module;
and the pronunciation adding unit is used for adding the pronunciation screened by the pronunciation screening condition unit to the pronunciation dictionary module.
5. A speech synthesis system according to claim 3, wherein the pronunciation screening module comprises:
the frequency statistics subunit is used for counting the occurrence frequency corresponding to the occurrence of the target pronunciation contained in the pronunciation library module;
The frequency counting subunit is used for counting the occurrence frequency corresponding to the occurrence of the target pronunciation in the pronunciation library module;
and the pronunciation adding unit is used for adding the target pronunciation to the pronunciation dictionary module under the condition that the occurrence frequency is greater than or equal to a preset occurrence frequency threshold value and the occurrence frequency is greater than or equal to a preset occurrence frequency threshold value.
6. The voice synthesis system based on man-machine voice interaction according to claim 1, wherein the pronunciation extraction module is further configured to obtain voice feature information corresponding to the pronunciation;
and the voice synthesis module is further used for synthesizing response voice corresponding to the response text by adopting the voice characteristic information and the pronunciation under the condition that the response text contains the target text of the preset type.
7. The human-machine-voice-interaction-based speech synthesis system according to claim 1, wherein the pronunciation sequence comprises at least one of: the system comprises an initial consonant and vowel sequence and a phonetic symbol sequence.
8. The human-machine-voice-interaction-based speech synthesis system according to any one of claims 1-7, wherein the pre-set type of target text comprises a physical word.
9. A terminal comprising a memory and a processor coupled to the memory; the memory is used for storing a computer program; the processor is configured to run the computer program to run the human-machine speech interaction based speech synthesis system according to any of claims 1-8.
10. A computer-readable storage medium, characterized in that the storage medium stores a computer program which, when executed by a processor, implements a speech synthesis system based on man-machine speech interaction according to any of claims 1-8.
CN202310092201.8A 2023-01-17 2023-01-17 Speech synthesis system and related equipment based on man-machine speech interaction Pending CN116110370A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310092201.8A CN116110370A (en) 2023-01-17 2023-01-17 Speech synthesis system and related equipment based on man-machine speech interaction

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310092201.8A CN116110370A (en) 2023-01-17 2023-01-17 Speech synthesis system and related equipment based on man-machine speech interaction

Publications (1)

Publication Number Publication Date
CN116110370A true CN116110370A (en) 2023-05-12

Family

ID=86255727

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310092201.8A Pending CN116110370A (en) 2023-01-17 2023-01-17 Speech synthesis system and related equipment based on man-machine speech interaction

Country Status (1)

Country Link
CN (1) CN116110370A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117116267A (en) * 2023-10-24 2023-11-24 科大讯飞股份有限公司 Speech recognition method and device, electronic equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117116267A (en) * 2023-10-24 2023-11-24 科大讯飞股份有限公司 Speech recognition method and device, electronic equipment and storage medium
CN117116267B (en) * 2023-10-24 2024-02-13 科大讯飞股份有限公司 Speech recognition method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
WO2021012734A1 (en) Audio separation method and apparatus, electronic device and computer-readable storage medium
US8510103B2 (en) System and method for voice recognition
CN1655235B (en) Automatic identification of telephone callers based on voice characteristics
US20080294433A1 (en) Automatic Text-Speech Mapping Tool
CN111402862B (en) Speech recognition method, device, storage medium and equipment
CN108847241A (en) It is method, electronic equipment and the storage medium of text by meeting speech recognition
CN110853615B (en) Data processing method, device and storage medium
JP2018522303A (en) Account addition method, terminal, server, and computer storage medium
CN109801630A (en) Digital conversion method, device, computer equipment and the storage medium of speech recognition
CN110335608B (en) Voiceprint verification method, voiceprint verification device, voiceprint verification equipment and storage medium
CN111192586B (en) Speech recognition method and device, electronic equipment and storage medium
CN112015872A (en) Question recognition method and device
CN108628819A (en) Treating method and apparatus, the device for processing
CN116110370A (en) Speech synthesis system and related equipment based on man-machine speech interaction
JP6599828B2 (en) Sound processing method, sound processing apparatus, and program
CN111062221A (en) Data processing method, data processing device, electronic equipment and storage medium
CN110111778A (en) A kind of method of speech processing, device, storage medium and electronic equipment
CN112133285B (en) Speech recognition method, device, storage medium and electronic equipment
CN104900226A (en) Information processing method and device
CN114360514A (en) Speech recognition method, apparatus, device, medium, and product
CN111640423A (en) Word boundary estimation method and device and electronic equipment
CN111489742B (en) Acoustic model training method, voice recognition device and electronic equipment
EP3718107B1 (en) Speech signal processing and evaluation
CN113724690A (en) PPG feature output method, target audio output method and device
CN113744718A (en) Voice text output method and device, storage medium and electronic device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination