CN113643684B

CN113643684B - Speech synthesis method, device, electronic equipment and storage medium

Info

Publication number: CN113643684B
Application number: CN202110827082.7A
Authority: CN
Inventors: 郑颖龙; 周昉昉; 叶杭; 赖蔚蔚; 吴广财; 林嘉鑫; 刘佳木; 陈颖璇; 朱泰鹏; 黄彬系
Original assignee: Guangdong Electric Power Information Technology Co Ltd
Current assignee: Guangdong Electric Power Information Technology Co Ltd
Priority date: 2021-07-21
Filing date: 2021-07-21
Publication date: 2024-02-27
Anticipated expiration: 2041-07-21
Also published as: CN113643684A

Abstract

The application discloses a voice synthesis method, a voice synthesis device, electronic equipment and a storage medium, and relates to the technical field of voice processing. The method comprises the following steps: in the voice broadcasting process, when the input voice of a user is detected, recognizing the voice characteristics of the input voice; according to the voice characteristics, determining voice parameters for broadcasting voice, wherein the voice parameters are used for generating voice corresponding to the voice parameters for text information to be broadcasted; based on the grammar analysis of the text information to be broadcasted, adding the identification information into the text information to be broadcasted to obtain target text information; and generating target voice for broadcasting based on the voice parameters and the target text information. Therefore, the corresponding voice parameters can be determined according to the voice characteristics of the user, and the personalized target voice for the user is generated based on the voice parameters, so that the voice interaction experience of the user is improved.

Description

Speech synthesis method, device, electronic equipment and storage medium

Technical Field

The present invention relates to the field of speech processing technologies, and in particular, to a speech synthesis method, apparatus, electronic device, and storage medium.

Background

With the development of artificial intelligence technology, man-machine conversation begins to widely enter into people's daily life, and common scenes include intelligent customer service robots, intelligent sound boxes, chat robots and the like. The core of man-machine conversation is that the machine can automatically understand and analyze the voice input by the user according to the data trained or learned in advance under the established system framework and give meaningful voice replies.

However, when the text information to be broadcasted is synthesized by voice, the input characters are matched with the pronunciation database one by one, and the pronunciation of all the characters are connected in series to generate the voice to be broadcasted.

Disclosure of Invention

In view of this, the present application proposes a speech synthesis method, apparatus, electronic device, and storage medium.

In a first aspect, an embodiment of the present application provides a method for synthesizing speech, where the method includes: in the voice broadcasting process, when the input voice of a user is detected, recognizing the voice characteristics of the input voice; according to the voice characteristics, determining voice parameters for broadcasting voice, wherein the voice parameters are used for generating voice corresponding to the voice parameters for text information to be broadcasted; based on the grammar analysis of the text information to be broadcasted, adding the identification information into the text information to be broadcasted to obtain target text information; and generating target voice for broadcasting based on the voice parameters and the target text information.

In a second aspect, embodiments of the present application provide a speech synthesis apparatus, the apparatus including: the system comprises a voice analysis module, a parameter determination module, an information addition module and a voice generation module. The voice analysis module is used for identifying voice characteristics of input voice of a user when the input voice of the user is detected; the parameter determining module is used for determining a voice parameter for broadcasting voice according to the voice characteristic, and the voice parameter is used for synthesizing target voice for broadcasting aiming at text information to be broadcasted; the information adding module is used for adding the identification information into the text information to be broadcasted based on the grammar analysis of the text information to be broadcasted so as to obtain target text information; and the voice generation module is used for generating target voice for broadcasting based on the voice parameters and the target text information.

In a third aspect, an embodiment of the present application provides an electronic device, including: one or more processors; a memory; one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs configured to perform the speech synthesis method provided in the first aspect.

In a fourth aspect, embodiments of the present application provide a computer readable storage medium having program code stored therein, the program code being callable by a processor to perform the speech synthesis method provided in the first aspect.

In the scheme provided by the application, in the voice broadcasting process, when the input voice of the user is detected, the voice characteristics of the input voice are recognized; determining a voice parameter for broadcasting voice according to the voice characteristic, wherein the voice parameter is used for generating voice corresponding to the voice parameter aiming at text information to be broadcasted; based on the grammar analysis of the text information to be broadcasted, adding the identification information into the text information to be broadcasted to obtain target text information; and generating target voice for broadcasting based on the voice parameters and the target text information. Therefore, the corresponding voice parameters can be determined according to the voice characteristics of the user, and the personalized target voice for the user is generated based on the voice parameters, so that the voice interaction experience of the user is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly introduced below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 shows a flow chart of a speech synthesis method according to an embodiment of the present application.

Fig. 2 is a flow chart illustrating a speech synthesis method according to another embodiment of the present application.

Fig. 3 is a flow chart illustrating a speech synthesis method according to still another embodiment of the present application.

Fig. 4 is a flow chart illustrating a speech synthesis method according to another embodiment of the present application.

Fig. 5 is a block diagram of a speech synthesis apparatus according to an embodiment of the present application.

Fig. 6 is a block diagram of an electronic device for performing a speech synthesis method according to an embodiment of the present application.

Fig. 7 is a storage unit for storing or carrying program code for implementing a speech synthesis method according to an embodiment of the present application.

Detailed Description

In order to enable those skilled in the art to better understand the present application, the following description will make clear and complete descriptions of the technical solutions in the embodiments of the present application with reference to the accompanying drawings in the embodiments of the present application.

In the related voice synthesis technology, only the input characters are matched with a pronunciation library one by one, and then the pronunciation of all the characters are connected in series to generate the voice to be broadcasted, the tone, the voice generated by the method are uniformly in tone, speed, volume and tone, and lack of variation, and the voice is easily perceived by a user to be automatically broadcasted or replied by a machine, hearing experience is reduced, and the heart is lost so as to seek for manual service, thus the intelligent response robot loses the fundamental role of saving manpower.

In view of the foregoing, the inventors propose a method, an apparatus, an electronic device, and a storage medium for voice synthesis, which are capable of determining a voice parameter for broadcasting a voice based on a voice feature of an input voice when the input voice of a user is detected during voice broadcasting, and generating a target voice for broadcasting based on the voice parameter and target text information. This will be described in detail below.

Referring to fig. 1, fig. 1 is a flow chart of a speech synthesis method according to an embodiment of the present disclosure. The speech synthesis method provided in the embodiment of the present application will be described in detail with reference to fig. 1. The speech synthesis method may include the steps of:

step S110: and in the voice broadcasting process, when the input voice of the user is detected, recognizing the voice characteristics of the input voice.

In this embodiment, the voice broadcast may be applied to various scenes, for example, an intelligent customer service system, an intelligent chat robot, an intelligent question-answering robot, or a telemarketing scene, which is not limited in this embodiment. The input voice of the user may be a voice uttered by the user to the currently used intelligent device supporting man-machine interaction, where the intelligent device may include an intelligent robot, a smart phone, an intelligent wearable device (such as an intelligent watch, an intelligent earphone, etc.), a tablet computer, a notebook computer, etc., which is not limited in this embodiment.

Optionally, in the human-computer voice interaction process between the user and the intelligent device, the user inputs voice first, the intelligent device broadcasts corresponding reply according to the input voice of the user to answer the information which the user wants to know, for example, in the intelligent customer service system, the user inputs voice "now several points", correspondingly, the intelligent device can broadcast corresponding reply voice "9 am now, ask for other requirements; the voice may also be a voice that is broadcast by the smart device first, for example, "ask for asking for insurance service, and the user may input a reply voice to the voice according to the broadcast voice, for example," if necessary, ask for which types of insurance service are requested ".

Based on the method, the intelligent device can monitor the input voice of the user in the man-machine voice interaction process, namely in the voice broadcasting process, and recognize the voice characteristics of the input voice when the input voice of the user is monitored, so that personalized reply voice is generated according to the voice characteristics of the user, and the hearing experience of the user is improved. The voice features may include various features, such as timbre, pitch, volume, voiceprint features, speech speed, etc., of the input voice, which is not limited by the present embodiment.

Step S120: and determining a voice parameter for broadcasting voice according to the voice characteristic, wherein the voice parameter is used for generating voice corresponding to the voice parameter for text information to be broadcasted.

In this embodiment, the voice parameters may include tone, timbre, speech speed, etc., which is not limited in this embodiment, and different voice features may correspond to different voice parameters of the broadcast voice, so that the broadcast voice generated based on the voice parameters is also different.

In some embodiments, the voice characteristics such as the pitch and the speech speed in the input voice can be used as the voice parameters for broadcasting the voice, specifically, if the pitch in the input voice is lower and the speech speed is slower, correspondingly, the pitch in the voice parameters for broadcasting the voice can also be lower and the speech speed is slower, so as to meet the speaking habit of the user. Therefore, the pitch and the speech rate in the input voice can be used as the pitch and the speech rate in the voice parameters for broadcasting the voice.

In other embodiments, the speech speed in the speech feature may be obtained, and the speech speed interval in which the speech speed is located may be determined, and the speech speed corresponding to the speech speed interval may be obtained as the speech speed in the speech parameters for broadcasting the speech. The mapping relation between different speech speed intervals and the speech speed corresponding to the intervals can be stored in advance, and after the speech speed of the user input speech is obtained, the speech speed for broadcasting the speech can be obtained based on the mapping relation after the speech speed interval of the speech speed output is determined. It will be appreciated that the process of determining the tone and the tone color is similar to the process of determining the speech rate, and reference is made to the above implementation process, and will not be repeated here.

Step S130: and adding the identification information into the text information to be broadcasted based on the grammar analysis of the text information to be broadcasted, and obtaining target text information.

In this embodiment, the text information to be broadcasted may be parsed, and the identification information may be added to the text information to be broadcasted, so that the text information to be broadcasted has more interest and affinity. The grammar analysis can be to divide subjects, predicates and objects of the text information to be broadcasted, wherein the identification information can be added between the subjects and the predicates or between the predicates and the objects; if the text information to be broadcasted contains a plurality of clauses, identification information can be added between adjacent clauses; if the text information to be broadcasted only contains one clause, the identification information can be added before and after the clause, which is not limited in this embodiment. The text information to be broadcasted can be reply text information determined according to the input information of the user, namely, the reply text information can be determined according to the input voice of the user, and when the fact that the number of times that the user breaks the voice in a preset time period exceeds a preset number of times is monitored during the broadcasting of the current text information, the fact that the user is not interested in the current broadcasted text information at the moment is judged, the preset inquiry text information is used as target text information, and the preset inquiry text information can be used for prompting the user to input the voice so as to know the content which the user wants to know; the text information to be broadcasted can also be preset broadcasted text information. Detecting the number of times of interrupting the voice of the user in a preset time period, and judging whether the number of times is larger than the preset number of times by detecting the number of times of speaking in the preset time period of the current voice broadcasting of the user, wherein the preset number of times can be preset or can be the preset number of times according to the number of problems contained in the current voice broadcasting, namely, the number of times of speaking exceeds the number of problems in the current voice broadcasting process of the user, and the number of times of speaking is represented as the number of times of interrupting the voice of the user.

In some embodiments, the identification information may be interactive text, that is, after the identification information is added to the text information to be broadcasted, the affinity and the interestingness of the information to be broadcasted may be increased, and the interactivity is improved. For example, the text information to be broadcasted is "3 am now," the subject "is" yes "is" predicate, "3 am" is "object," the word "already" can be added between the subject and the predicate, the last addition of the text information to be broadcasted is "the sky is late, please go to sleep early" to prompt the user to go to sleep early, the interactivity in the broadcasting voice process is increased, and correspondingly, the finally obtained target text information is "3 am now, the sky is late, and please go to sleep early".

In other embodiments, the identification information may also be a nonsense phrase, and after the nonsense phrase is added to the text information to be broadcasted, the generated corresponding voice may include a pause or a idiom, such as "one", "two", "one", "the other", etc., so that the user may not easily perceive that the object speaking with the user is an automatic answering robot or a voice broadcast.

Step S140: and generating target voice for broadcasting based on the voice parameters and the target text information.

Based on this, after determining the voice parameter and the target Text information, the target Text information may be converted into voice according To the voice parameter by a Text To Speech (TTS) technique, that is, the target voice for broadcasting. The target speech may be generated by a parametric method based on speech parameters, that is, by generating a fundamental frequency, a formant frequency, and the like of the target speech by parameter adjustment so that the target speech may satisfy the above-mentioned speech parameters, and it may be understood that a speech speed, a tone color, a tone, and the like of the generated target speech may be matched with the above-mentioned speech parameters.

In this embodiment, corresponding voice parameters may be determined according to voice features of a user, and target voices personalized for the user are generated based on the voice parameters, so that voice interaction experience of the user is improved; meanwhile, the identification information is added to the text information to be broadcasted, so that the synthesized target voice comprises pause, repeated language, interactive voice and the like, and further the target voice can be heard more with affinity and interestingness, so that a user can be more difficult to perceive that the object speaking with the target voice is an automatic response robot or voice broadcasting, smooth proceeding of an automatic voice broadcasting process is ensured, manual service is further reduced, and labor cost is saved.

Referring to fig. 2, fig. 2 is a flow chart of a speech synthesis method according to another embodiment of the present disclosure. The speech synthesis method provided in the embodiment of the present application will be described in detail below with reference to fig. 2. The speech synthesis method may include the steps of:

step S210: and in the voice broadcasting process, when the input voice of the user is detected, recognizing the voice characteristics of the input voice.

In this embodiment, the specific implementation of step S210 may refer to the content in the foregoing embodiment, which is not described herein.

Step S220: and determining user attribute information of the user according to the voice characteristics.

In the present embodiment, the user attribute information of the user may be determined according to the voice characteristics of the input voice, wherein the user attribute information may include a plurality of types, for example, age, gender, region to which the user belongs, cultural degree, and the like, and the voice characteristics may also include a plurality of types, for example, tone, pitch, volume, voiceprint characteristics, speech speed, accent, and the like, which is not limited in the present embodiment. Wherein gender is determinable from the user's timbre and/or tone; the age may be determined from the user's tone and/or voiceprint characteristics; the region may be determined based on the accent of the user; the cultural degree may be determined according to the age of the user and the region to which the user belongs.

Step S230: and acquiring voice parameters corresponding to the user attribute information as voice parameters for broadcasting voice.

Based on this, after the user attribute information of the user is acquired, further, a voice parameter corresponding to the user attribute information may be acquired as a voice parameter for broadcasting voice. That is, the user attribute information of each user is different, the acquired corresponding voice parameters are also different, the corresponding generated voice parameters for broadcasting are also different, and further the generated target voice for broadcasting is also different, that is, in the voice interaction process, the individuation for generating the voice of each user is realized.

In some embodiments, if the user attribute information is a user age, acquiring an age interval in which the user age is located as a target age interval; and acquiring voice parameters corresponding to the target age interval as voice parameters for broadcasting voice. The user age can be obtained by identifying and writing voiceprint features of input voice; and the voice parameters corresponding to the age intervals can be stored in advance, namely, each age interval has a mapping relation with the voice parameters corresponding to the age interval, after the age of the user is obtained currently, the age of the user is judged to be in which of the age intervals stored in advance, the age interval is taken as a target age interval, the voice parameters corresponding to the target age interval can be obtained according to the mapping relation, and then the voice parameters are taken as the voice parameters for broadcasting voice. When a plurality of age intervals and voice parameters corresponding to each age interval are preset, the user cultural degree with higher ages and lower ages may not be high, and the understanding speed of things may be slow, so that the voice volume in the voice parameters corresponding to the lower ages and the higher ages and the voice speed in the voice parameters corresponding to the lower ages and the higher ages can be increased, so that the user with lower ages and the higher ages can be ensured to hear clearly broadcasted voice, and the situation that the user fails to acquire and understand the content in the broadcasted voice in time due to too high voice speed or lower voice volume is prevented.

For example, the plurality of age intervals stored in advance include respectively: if the acquired user age is 20 years old, the user age can be in the age range of [19 years old to 30 years old ], and further, the prestored voice parameters corresponding to the age range of [19 years old to 30 years old ] are acquired as voice parameters for broadcasting voice.

In other embodiments, if the user attribute information is the user gender, the voice parameter corresponding to the user gender is obtained as the voice parameter for broadcasting the voice. The gender of the user is judged by the voice characteristics of the input voice of the user, specifically, whether the user is a male or a female can be judged by the frequency of the input voice, and as the tone of the male voice is lower than the tone of the female voice, the frequency of the male voice is understandably lower than the frequency of the female voice, so that the frequency of the input voice can be obtained, whether the frequency belongs to a low-frequency area or a high-frequency area can be judged, and if the frequency of the input voice belongs to the high-frequency area, the user can be judged as a female; if the frequency of the input voice is in the low frequency region, the user can be determined to be male. The frequency threshold values of the low frequency region and the high frequency region can be obtained by statistical analysis according to sound frequency data of a large number of men and women. Since the difference between female and male speaking modes is also large, for example, female speaking speed is generally slow, and male speaking speed is generally fast, the voice parameters corresponding to female and male can be set differently, the voice parameters for female can be set to be relatively slow, the tone is relatively gentle, and the voice parameters for male can be set to be relatively fast and the volume is large. Of course, the voice parameters corresponding to different polarities may be set by the user according to different application scenarios, which is not limited in this embodiment.

In still other embodiments, if the user attribute is a region to which the user belongs, a voice parameter corresponding to the region to which the user belongs is obtained as the voice parameter for broadcasting the voice. The region corresponding to the accent of the user may be determined according to the voice characteristics of the user, that is, the region to which the user belongs, where the region may be a country, a province, a city, etc., and the voice parameters corresponding to different regions may be preset, which is not limited in this embodiment. Because the users in different areas have different speaking habits, the speaking accent in the area can be used as the voice parameter corresponding to the area to which the user belongs, namely the speaking accent corresponding to the area to which the user belongs is used as the voice parameter for broadcasting the voice, and the generated accent for broadcasting the voice also accords with the speaking accent in the area to which the user belongs, so that in the voice communication process, the user is more relatives, and the user is more difficult to perceive that the object speaking with the user is an automatic response robot or voice broadcasting.

For example, if the accent of the user is obtained as the Sichuan, the area corresponding to the Sichuan is Sichuan province, so that the area to which the user belongs can be judged as Sichuan province, and the accent of the Sichuan is used as the voice parameter for broadcasting the voice.

In still other embodiments, the user attribute information of the user may include multiple types at the same time, in order to further improve accuracy of acquiring the voice parameters corresponding to the user attribute information, a multi-dimensional mapping relation table may be established in advance for the multiple types of user attribute information and preset voice parameters corresponding to the user attribute information, and after the multiple types of user attribute information of the user are acquired, the voice parameters corresponding to the multiple types of user attribute information are determined as the voice parameters for broadcasting the voice based on the multi-dimensional mapping relation table and the multiple types of user attribute information. Specifically, if the attribute information of multiple users includes gender, age, region and cultural degree at the same time, the preset voice parameters in the multidimensional mapping relation table correspond to preset gender, preset age interval, preset region and preset cultural degree, based on the preset voice parameters, after the gender, age, region and cultural degree of the current user are obtained, the preset gender, the preset age interval, the preset region and the preset cultural degree in the multi-user mapping relation table are respectively matched, the preset gender which is the same as the gender of the current user is obtained as a target gender, the preset age interval in which the age of the current user is located is obtained as a target age interval, the region which is matched with the region to which the current user belongs is obtained as a target region, and the cultural degree which is matched with the cultural degree of the current user is obtained as a target cultural degree; and acquiring voice parameters corresponding to the target gender, the target age interval, the target region and the target cultural degree from the multidimensional mapping table as voice parameters for broadcasting voice.

For example, if the sex of the current user is female, the age is 24 years, the region to which the current user belongs is Sichuan province, and the cultural degree is the family, the preset age range includes 0 year old to 19 year old, 20 year old to 39 year old, and 40 year old to 80 year old, the region to which the current user belongs includes 23 provinces in China, and the cultural degree includes the family and above and the family below. Based on the multidimensional mapping table, the target gender is determined to be female, the target age interval in which the age is located is 20-40 years old, the target region is Sichuan province, and the target cultural degree is the family or more, so that the voice parameters corresponding to the gender being female, the target age interval being 20-40 years old, the target region being Sichuan province, and the target cultural degree being the family or more can be obtained from the multidimensional mapping table as the voice parameters for broadcasting the voice.

Step S240: and adding the identification information into the text information to be broadcasted based on the grammar analysis of the text information to be broadcasted, and obtaining target text information.

Step S250: and generating target voice for broadcasting based on the voice parameters and the target text information.

In this embodiment, the specific implementation of step S240 to step S250 may refer to the content in the foregoing embodiment, and will not be described herein.

In this embodiment, user attribute information of a user may be determined according to a voice feature of an input voice of the user, corresponding voice parameters may be determined according to the user attribute information, and a target voice for broadcasting may be generated based on the voice parameters. Therefore, target voices with different voice parameters and used for broadcasting can be generated according to different user attribute information, namely, in the voice interaction process, individuation aiming at voice generation of each user is realized, the relativity of voice communication between the man-machine is improved, and the user is more difficult to perceive that the object speaking with the robot is an automatic response robot or voice broadcasting.

Referring to fig. 3, fig. 3 is a flow chart of a speech synthesis method according to another embodiment of the present disclosure. The speech synthesis method provided in the embodiment of the present application will be described in detail below with reference to fig. 3. The speech synthesis method may include the steps of:

step S310: and in the voice broadcasting process, when the input voice of the user is detected, recognizing the voice characteristics of the input voice.

In this embodiment, the specific implementation of step S310 may refer to the content in the foregoing embodiment, which is not described herein.

Step S320: and determining emotion information of the user according to the voice characteristics.

In this embodiment, the user emotion information may be information representing the emotion of the user, and the emotion of the user may include happiness, anger, sadness, surprise, fear, confusion, concentration, distraction, and the like, which is not limited herein.

The voice feature may be a user's mood, that is, performing voice analysis on the input voice to obtain the current mood of the user. As a specific embodiment, the input speech may be analyzed to obtain parameter information related to speaking mood, such as speech volume, pitch, speech content, etc., and the user mood may be determined according to specific parameter values of the parameter information, and a specific manner of analyzing the user mood may not be limited. Based on the method, the user mood information can be obtained by further analyzing the user mood. Of course, the embodiment of obtaining the emotion information of the user according to the user's mood is not limited.

In some embodiments, if the emotion information of the user includes both excited emotion and calm emotion, determining whether the volume of the input voice is greater than a preset volume threshold, and if the volume is greater than the preset volume threshold, determining that the emotion information of the user is excited; if the volume is smaller than or equal to the preset volume threshold, the emotion of the user is judged to be calm. The preset volume threshold may be preset, or may be adjusted according to different application scenarios, which is not limited in this embodiment.

In other embodiments, if the emotion information of the user includes both excited emotion and calm emotion, determining whether the speech rate of the input speech is greater than a preset speech rate threshold, and if the speech rate is greater than the preset speech rate threshold, determining that the emotion information of the user is excited; if the speech speed is less than or equal to the preset speech speed threshold, determining that the emotion of the user is calm. The preset speech speed threshold may be preset, or may be adjusted according to different application scenarios, which is not limited in this embodiment.

In still other embodiments, the emotion of the user may also be determined from a variety of speech feature parameters. Specifically, if the emotion information of the user comprises three emotions of very excited, excited and calm, judging whether the speech speed of the input speech is greater than a preset speech speed threshold value and the volume of the input speech is greater than a preset volume threshold value, if the speech speed is greater than the preset speech speed threshold value and the volume is greater than the preset volume threshold value, judging that the emotion information of the user is very excited; if the speech speed is greater than the preset speech speed threshold value, but the volume is smaller than or equal to the preset volume threshold value, or if the volume is greater than the preset volume threshold value, but the speech speed is smaller than or equal to the preset speech speed threshold value, judging that the emotion information of the user is more exciting; if the volume is smaller than or equal to the preset volume threshold and the speech speed is also smaller than or equal to the preset speech speed threshold, the emotion information of the user is judged to be calm.

In still other embodiments, multiple voice characteristic parameters of the user can be input into a pre-trained emotion scoring model to obtain emotion scores; comparing the emotion score with a preset score threshold, and judging that the emotion information of the user is excited if the emotion score is larger than the preset score threshold; and if the emotion score is smaller than or equal to the preset score threshold value, determining that the emotion information of the user is calm. The preset score threshold may be preset, or may be adjusted according to different application scenarios, which is not limited in this embodiment.

Step S330: and acquiring voice parameters corresponding to the emotion information as voice parameters for broadcasting voice.

Based on this, after determining the emotion information of the user, the voice parameter corresponding to the emotion information may be acquired as the voice parameter for broadcasting. To improve the interactivity of the automatic broadcasting, the voice parameters for broadcasting the voice may be changed according to the emotion change of the user so that the user feels that the automatic answering robot or the intelligent customer service carefully communicates with her. Therefore, a plurality of emotion information can be preset in advance, and voice parameters corresponding to each emotion information can be set, for example, if the emotion information of the user is excited, at this time, the tone in the voice parameters can be set to be softer, the tone is set to be lower, and the volume is set to be smaller, so that the target voice for broadcasting generated based on the voice parameters gives people a feel of being softer in hearing, and the user with excited emotion currently exists is calmed.

In some embodiments, when the emotion information meets a set emotion condition, acquiring first text information as text information to be broadcasted, wherein the first text information is used for adjusting the emotion of the user. The set emotion conditions may be sad emotion, excited emotion, etc., and the corresponding first text information may be different from different set emotion conditions. Specifically, if the emotional condition is set to be excited emotion, the first text information may be text information that calms the user, such as "do not excite, if you are not interested in this package, you can see another package … …".

In other embodiments, because the user's responses to the same emotional information may be different for different user attribute information, the voice parameters applied to the voice broadcast may also be different, e.g., the male and female's responses to the same thing may be different, the female's emotion may feel very happy, but the male may behave more commonly. Based on the above, after determining the emotion information of the user, the user attribute information of the user can be determined based on the voice characteristics, and the voice parameters corresponding to the emotion information and the user attribute information of the user are determined as the voice parameters for voice broadcasting. Specifically, the user attribute information may include gender and age of the user, and a multidimensional mapping relation table may be established in advance for the gender, age and emotion information of the user, where the multidimensional mapping relation table includes preset voice parameters and corresponding preset gender, preset age interval and preset emotion information, the preset gender includes male and female, the preset age interval may be multiple age intervals, for example, 0-14 years old, 15-55 years old, 56-80 years old, and the preset emotion information may also include multiple emotions, for example, sad emotion, excited emotion, happy emotion, and the like; based on the above, if the age, the gender and the emotion information of the current user are obtained, the age, the gender and the emotion information of the current user are matched with the preset gender, the preset age interval and the preset emotion information in the multidimensional mapping table, the preset gender which is the same as the gender of the current user is used as the target gender, the preset age interval in which the age of the current user is located is used as the target age interval, and the preset emotion information which is matched with the emotion information of the current user is used as the target emotion information; and then the voice parameters corresponding to the target gender, the target age interval and the target emotion information in the multidimensional mapping table are used as voice parameters for broadcasting.

Step S340: and adding the identification information into the text information to be broadcasted based on the grammar analysis of the text information to be broadcasted, and obtaining target text information.

Step S350: and generating target voice for broadcasting based on the voice parameters and the target text information.

In this embodiment, the specific implementation of step S340 to step S350 may refer to the content in the foregoing embodiment, and will not be described herein.

In this embodiment, according to the voice characteristics of the input voice of the user, the emotion information of the user can be determined, then the corresponding voice parameters are determined according to the emotion information, and the target voice for broadcasting is generated based on the voice parameters. Therefore, target voices which are used for broadcasting and have different mood and different speech speeds can be generated according to the change of the emotion information of the user, namely, in the voice interaction process, the individuation aiming at the voice generation of each user is realized, the relativity of voice communication between the man-machine is improved, and the user is more difficult to perceive that the object speaking with the robot is an automatic response robot or voice broadcasting.

Referring to fig. 4, fig. 4 is a flow chart of a speech synthesis method according to another embodiment of the present disclosure. The speech synthesis method provided in the embodiment of the present application will be described in detail with reference to fig. 4. The speech synthesis method may include the steps of:

Step S410: and in the voice broadcasting process, when the input voice of the user is detected, recognizing the voice characteristics of the input voice.

Step S420: and determining a voice parameter for broadcasting voice according to the voice characteristic, wherein the voice parameter is used for generating voice corresponding to the voice parameter for text information to be broadcasted.

In this embodiment, the specific implementation of step S410 to step S420 may refer to the content in the foregoing embodiment, and will not be described herein.

Step S430: and identifying clauses in the text information to be broadcasted to obtain a plurality of clauses.

Step S440: and acquiring target clauses existing in the multiple clauses, wherein the number of words in the target clauses is larger than a first threshold value.

In this embodiment, the number of words in some clauses in the text information to be broadcasted may be large, and if the clauses with large number of words are directly converted into voice, the converted voice may be hard, so that the hearing experience of the user is affected. Therefore, the clauses in the text information to be broadcasted can be identified to obtain a plurality of clauses, the word number of each clause is obtained, whether the word number of the clause exceeds a first threshold value is judged, if the word number of the clause exceeds the first threshold value, the clause is judged to be a long sentence, and the clause is taken as a target clause. The first threshold may be preset (e.g. 10), or may be adjusted according to a specific application scenario.

Step S450: the target clause is divided into a plurality of clause components based on a syntactic analysis of the target clause.

Step S460: and adding the connective words between adjacent clause components to obtain target text information.

Based on this, after determining the target clause with a large number of words, a connective may be added to the target clause to make the converted target speech more like a real person. However, when adding the connective, if content that may affect the intended expression of the text information to be broadcasted is added at will, the target clause may be parsed to divide the target clause into a plurality of clause components, where the clause components may include a subject, a predicate, an object, an animal, a fixed object, a scholarly, a complement, a center, and the like. Further, a connective word may be added between adjacent separate components to obtain the target text information. Wherein, the connecting words can be words such as "couple", "amount", "that", "one", "another", etc., and are not limited thereto.

For example, if the target clause is "voice which cannot be recognized by the smart phone when the network of the smart phone is not good", the target clause may be parsed into a plurality of clause components, such as a phrase "when the network of the smart phone is not good", a subject "smart phone", a predicate "cannot be recognized", and an object "voice which is uttered by the user"; based on this, a connective can be added randomly between the subjects, predicates, objects; the connection word can also be added between the two specified clause components, for example, the connection word 'that' is added between the scholarly word and the subject word only, and the target clause after the connection word is added becomes 'the voice which the smart phone cannot recognize the user when the network of the smart phone is bad', so that the target clause can be more spoken when being converted into the target voice.

In practical application, there may be a large number of words in a certain clause component in the multiple clause components, if the words are only directly converted into the target voice, the target voice may be broadcasted for a long time, but no pause exists, and still a hard hearing feeling of the broadcasted voice is caused for the user, so that the user considers that the current conversation with the user is a response robot, and thus the user loses the patience and seeks for manual customer service.

Based on this, in some embodiments, a target clause component present in the plurality of clause components may be obtained, the number of words in the target clause component being greater than a second threshold, the second threshold being less than the first threshold; and adding a pause identifier between the target clause component and the adjacent clause component, wherein the pause identifier is used for generating pause voice with specified duration between the voice corresponding to the target clause component and the voice corresponding to the adjacent clause component when generating the target voice. The pause identifier may be comma, period, semicolon or pause number, which is not limited in this embodiment, and the different pause identifiers also have different durations corresponding to the generated pause voice. That is, after a plurality of clause components are acquired, the word count of each clause component is acquired, whether the word count is larger than a second threshold value is determined, if the word count is larger than the second threshold value, the word count of the clause component is determined to be larger, and the clause component is taken as a target clause component. Further, a pause flag is added between the target phrase component and the adjacent phrase component, so that when the target speech is generated, pause speech with a specified duration is generated between the speech corresponding to the target phrase component and the speech corresponding to the adjacent phrase component. Thus, the generated target voice is further close to the habit of speaking by a real person, and after the voice with a large number of words is spoken, the user pauses and speaks the next content.

For example, still take the example of the target clause "when the network of the smart phone is not good, the smart phone cannot recognize the voice uttered by the user", wherein the idiom is "when the network of the smart phone is not good", the subject is "smart phone", and because the number of the idioms is large, a pause identifier (such as comma) can be added between the idiom and the subject, and the target clause after comma addition becomes "when the network of the smart phone is not good, the smart phone cannot recognize the voice uttered by the user", so that the pause voice with the specified duration is generated between the voice corresponding to the generated idiom and the voice corresponding to the subject.

In some embodiments, if the target text information includes a plurality of clauses, a specific identifier is added between every two adjacent clauses in the plurality of clauses, where the specific identifier is used to generate ventilation voice between voices corresponding to every two adjacent clauses when generating the target voice. It will be appreciated that in order to make the generated target speech closer to the speech of the real person, when the real person speaks, there will be a ventilation sound between each clause, based on which a specified identifier may be added between each two adjacent clauses in the multiple clauses in the target text information, so that when the target speech is generated, a ventilation speech is generated between the speech corresponding to each two adjacent clauses.

Step S470: and generating target voice for broadcasting based on the voice parameters and the target text information.

In this embodiment, the specific implementation of step S470 may refer to the content in the foregoing embodiment, which is not described herein.

In some embodiments, if the generation of the target voice is not completed within the specified duration, a preset voice is obtained and is used as the target voice for broadcasting. The specified duration may be preset, or may be adjusted according to a specific application scenario, which is not limited in this embodiment. In practical applications, the network of the intelligent device used by the user may be poor, resulting in slower recognition of the voice input by the user or slower speed in generating the target voice, and thus, incomplete generation of the target voice within a specified duration; if voice broadcasting is not performed at this time, a voice chat cold field may be caused, and then the user ends the current voice interaction process. Therefore, the preset voice can be obtained as the target voice for broadcasting, wherein the preset voice can be voice for avoiding cold spots such as 'one wants to feel', 'one wants to want', 'one wants to feel like a little' and the like, so as to relieve the problem that a user waits for being impatient and the like caused by longer generation of the target voice, and if the synthesis of the target voice is completed after the broadcasting of the preset voice, the target voice can be continuously broadcasted so as to continue chat contents with the user.

In some implementations, the voice quality of the input voice may also be analyzed; when the voice quality is lower than a preset quality threshold, acquiring second text information, wherein the second text information is used for prompting the user to input voice with the voice quality reaching the preset quality threshold again; and taking the second text information as the second text information. The voice quality of the input voice can be determined by acquiring the signal-to-noise ratio of the input voice, and when the signal-to-noise ratio is larger than a preset value, the voice quality is judged to be lower than a preset quality threshold value, so that the second text information is acquired as the target text information. When the voice quality of the user is poor, the user is prompted to increase the volume or keep away from noise to input voice again, so that the voice of the user can be prevented from being unrecognized due to the poor voice quality of the user.

In this embodiment, when the text information to be broadcasted includes a plurality of clauses, a specified identifier may be added between adjacent clauses, so as to generate ventilation voice between voices corresponding to every two adjacent clauses, so that the generated target voice is closer to voice habit when the person speaks; and the connecting words are added in the clause component with more words, and the pause mark is added between the clause with more words and the adjacent clause, so that the generated target voice is more natural and more spoken, the relativity of voice communication between the man and the machine is improved, and the user is more difficult to perceive that the object speaking with the robot is an automatic response robot or voice broadcasting.

Referring to fig. 5, a block diagram of a speech synthesis apparatus 500 according to another embodiment of the present application is shown. The apparatus 500 may include: a voice analysis module 510, a parameter determination module 520, an information addition module 530, and a voice generation module 540.

The voice analysis module 510 is configured to, when detecting an input voice of a user, recognize a voice feature of the input voice;

the parameter determining module 520 is configured to determine, according to the voice feature, a voice parameter for broadcasting voice, where the voice parameter is used to synthesize a target voice for broadcasting for the text information to be broadcasted;

the information adding module 530 is configured to add identification information to the text information to be broadcasted based on the syntax analysis of the text information to be broadcasted, so as to obtain target text information;

the voice generating module 540 is configured to generate a target voice for broadcasting based on the voice parameter and the target text information.

In some implementations, the parameter determination module 520 can include: an information determination unit and a parameter acquisition unit. The information determining unit may be configured to determine user attribute information of the user according to the voice feature. The parameter obtaining unit may be configured to obtain a voice parameter corresponding to the user attribute information, as a voice parameter for broadcasting voice.

In this manner, the user attribute information includes a user age, and the parameter acquisition unit may include: an interval acquisition subunit and a parameter acquisition subunit. The interval obtaining subunit may be configured to obtain, as the target age interval, an age interval in which the age of the user is located. The parameter obtaining subunit may be configured to obtain a voice parameter corresponding to the target age interval, as a voice parameter for broadcasting voice.

In other embodiments, the parameter determination module 520 may include: an emotion determining unit and a parameter acquiring unit. Wherein the emotion determining unit may be configured to determine emotion information of the user based on the speech feature. The parameter obtaining unit may be configured to obtain a voice parameter corresponding to the emotion information, as a voice parameter for broadcasting voice.

In this manner, the speech synthesis apparatus 500 may further include: and a first acquisition module. The first obtaining module may be specifically configured to obtain, when the emotion information meets a set emotion condition before obtaining the target text information, first text information as text information to be broadcasted, where the first text information is used to adjust emotion of the user, where the first text information is based on grammar analysis of the text information to be broadcasted.

In some embodiments, the identification information includes a connective, and the information adding module 530 may include: the device comprises an identification unit, a target clause acquisition unit, a clause dividing unit and an information adding unit. The identifying unit may be configured to identify clauses in the text information to be broadcasted, so as to obtain a plurality of clauses. The target clause obtaining unit may be configured to obtain a target clause existing in the multiple clauses, where the number of words in the target clause is greater than a first threshold. The clause dividing unit may be configured to divide the target clause into a plurality of clause components based on a syntax analysis of the target clause. The information adding unit may be configured to add the connective between adjacent clause components.

In this manner, the speech synthesis apparatus 500 may further include: and a target component acquisition module. The target component obtaining module may be configured to obtain, after the target clause is divided into a plurality of clause components based on the syntax analysis of the target clause, a target clause component existing in the plurality of clause components, where the number of words in the target clause component is greater than a second threshold, and the second threshold is smaller than the first threshold. The information adding unit may be specifically configured to add a pause identifier between the target phrase component and an adjacent phrase component, where the pause identifier is configured to generate, when generating the target speech, a pause speech with a specified duration between a speech corresponding to the target phrase component and a speech corresponding to the adjacent phrase component.

In some embodiments, the information adding module may be specifically configured to, before the generating, based on the voice parameter and the target text information, add, if the target text information includes a plurality of clauses, a specific identifier between each two adjacent clauses in the plurality of clauses, where the specific identifier is used to generate, when the target voice is generated, a ventilation voice between voices corresponding to each two adjacent clauses.

In some embodiments, the speech synthesis apparatus 500 may further include: and a voice acquisition unit. The voice obtaining unit may be configured to obtain a preset voice as the target voice for broadcasting if the generation of the target voice is not completed within the specified duration.

It will be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process of the apparatus and modules described above may refer to the corresponding process in the foregoing method embodiment, which is not repeated herein.

In several embodiments provided herein, the coupling of the modules to each other may be electrical, mechanical, or other.

In addition, each functional module in each embodiment of the present application may be integrated into one processing module, or each module may exist alone physically, or two or more modules may be integrated into one module. The integrated modules may be implemented in hardware or in software functional modules.

In summary, in the scheme provided by the embodiment of the present application, the corresponding voice parameters may be determined according to the voice characteristics of the user, and the target voice personalized for the user may be generated based on the voice parameters, so as to promote the voice interaction experience of the user; meanwhile, the identification information is added to the text information to be broadcasted, so that the synthesized target voice comprises pause, repeated language, interactive voice and the like, and further the target voice can be heard more with affinity and interestingness, so that a user can be more difficult to perceive that the object speaking with the target voice is an automatic response robot or voice broadcasting, smooth proceeding of an automatic voice broadcasting process is ensured, manual service is further reduced, and labor cost is saved.

An electronic device provided in the present application will be described below with reference to the drawings.

Referring to fig. 6, fig. 6 shows a block diagram of an electronic device 600 according to an embodiment of the present application, and an alarm notification method according to an embodiment of the present application may be executed by the electronic device 600.

The server 600 in embodiments of the present application may include one or more of the following components: a processor 601, a memory 602, and one or more application programs, wherein the one or more application programs may be stored in the memory 602 and configured to be executed by the one or more processors 601, the one or more program configured to perform the method as described in the foregoing method embodiments.

Processor 601 may include one or more processing cores. The processor 601 utilizes various interfaces and lines to connect various portions of the overall electronic device 600, perform various functions of the electronic device 600 and process data by executing or executing instructions, programs, code sets, or instruction sets stored in the memory 602, and invoking data stored in the memory 602. Alternatively, the processor 601 may be implemented in hardware in at least one of digital signal processing (Digital Signal Processing, DSP), field programmable gate array (Field-Programmable Gate Array, FPGA), programmable logic array (Programmable Logic Array, PLA). The processor 601 may integrate one or a combination of several of a central processing unit (Central Processing Unit, CPU), an image processor (Graphics Processing Unit, GPU), and a modem, etc. The CPU mainly processes an operating system, a user interface, an application program and the like; the GPU is used for being responsible for rendering and drawing of display content; the modem is used to handle wireless communications. It will be appreciated that the modem may also be integrated into the processor 601 and implemented solely by a communication chip.

The Memory 602 may include random access Memory (Random Access Memory, RAM) or Read-Only Memory (rom). Memory 602 may be used to store instructions, programs, code, a set of codes, or a set of instructions. The memory 602 may include a stored program area and a stored data area, wherein the stored program area may store instructions for implementing an operating system, instructions for implementing at least one function (e.g., a touch function, a sound playing function, an image playing function, etc.), instructions for implementing the various method embodiments described below, etc. The storage data area may also store data created by the electronic device 600 in use (such as the various correspondences described above), and so forth.

In the several embodiments provided herein, the coupling or direct coupling or communication connection of the illustrated or discussed modules to each other may be through some interfaces, and the indirect coupling or communication connection of the apparatus or modules may be in electrical, mechanical, or other forms.

Referring to fig. 7, a block diagram of a computer readable storage medium according to an embodiment of the present application is shown. The computer readable medium 700 has stored therein program code which may be invoked by a processor to perform the methods described in the method embodiments above.

The computer readable storage medium 700 may be an electronic memory such as a flash memory, an EEPROM (electrically erasable programmable read only memory), an EPROM, a hard disk, or a ROM. Optionally, the computer readable storage medium 700 comprises a non-transitory computer readable medium (non-transitory computer-readable storage medium). The computer readable storage medium 700 has memory space for program code 710 that performs any of the method steps described above. The program code can be read from or written to one or more computer program products. Program code 710 may be compressed, for example, in a suitable form.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and are not limiting thereof; although the present application has been described in detail with reference to the foregoing embodiments, one of ordinary skill in the art will appreciate that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not drive the essence of the corresponding technical solutions to depart from the spirit and scope of the technical solutions of the embodiments of the present application.

Claims

1. A method of speech synthesis, the method comprising:

in the voice broadcasting process, when the input voice of a user is detected, acquiring the voice quality of the input voice;

if the voice quality reaches a preset quality threshold, identifying voice characteristics of the input voice;

determining user attribute information of the user according to the voice characteristics, wherein the user attribute information comprises age, gender, region and cultural degree;

acquiring voice parameters corresponding to the user attribute information as voice parameters for broadcasting voice, wherein the voice parameters are used for generating voice corresponding to the voice parameters for text information to be broadcasted, and the text information to be broadcasted is reply text information determined based on the input voice;

Based on the grammar analysis of the text information to be broadcasted, adding identification information into the text information to be broadcasted to obtain target text information, wherein the identification information is interactive text or nonsensical phrase;

generating target voice for broadcasting based on the voice parameters and the target text information, wherein broadcasting accents of the target voice are consistent with speaking accents corresponding to the affiliated areas;

detecting the times of interrupting the voice of the user in a preset time period for broadcasting the target voice;

if the number of times is greater than a preset number of times, acquiring preset inquiry text information, and generating voice for broadcasting based on the preset inquiry text information, wherein the preset inquiry text information is used for prompting the user to input voice;

and if the voice quality is lower than the preset quality threshold, acquiring second text information, and generating target voice for broadcasting based on the second text information, wherein the second text information is used for prompting the user to increase the volume or reenter voice far from noise.

2. The method of claim 1, wherein the user attribute information includes a user age, and the obtaining the voice parameter corresponding to the user attribute information includes:

Acquiring an age interval in which the user age is located as a target age interval;

and acquiring the voice parameters corresponding to the target age interval as voice parameters for broadcasting voice.

3. The method of claim 1, wherein determining the voice parameters for broadcasting the voice based on the voice characteristics comprises:

determining emotion information of the user according to the voice characteristics;

and acquiring voice parameters corresponding to the emotion information as voice parameters for broadcasting voice.

4. A method according to claim 3, wherein before adding the identification information to the text information to be broadcasted based on the parsing of the text information to be broadcasted, the method further comprises:

when the emotion information meets the set emotion condition, acquiring first text information as text information to be broadcasted, wherein the first text information is used for adjusting the emotion of the user.

5. The method according to claim 1, wherein the identification information includes a connective word, and the adding the identification information to the text information to be broadcasted based on the syntax analysis of the text information to be broadcasted, to obtain the target text information, includes:

Identifying clauses in the text information to be broadcasted to obtain a plurality of clauses;

acquiring target clauses existing in the multiple clauses, wherein the number of words in the target clauses is larger than a first threshold value;

dividing the target clause into a plurality of clause components based on a syntactic analysis of the target clause;

and adding the connective between adjacent clause components.

6. The method of claim 5, wherein after the dividing the target clause into a plurality of clause components based on the parsing of the target clause, the method further comprises:

acquiring a target clause component existing in the plurality of clause components, wherein the number of words in the target clause component is larger than a second threshold value, and the second threshold value is smaller than the first threshold value;

and adding a pause identifier between the target clause component and the adjacent clause component, wherein the pause identifier is used for generating pause voice with specified duration between the voice corresponding to the target clause component and the voice corresponding to the adjacent clause component when generating the target voice.

7. The method of any of claims 1-6, wherein prior to the generating target speech for broadcasting based on the speech parameters and target text information, the method further comprises:

If the target text information comprises a plurality of clauses, adding a specified identifier between every two adjacent clauses in the plurality of clauses, wherein the specified identifier is used for generating ventilation voice between voices corresponding to every two adjacent clauses when generating the target voice.

8. The method of any of claims 1-6, wherein prior to the generating target speech for broadcasting based on the speech parameters and target text information, the method further comprises:

if the generation of the target voice is not completed within the appointed duration, acquiring a preset voice as the target voice for broadcasting.

9. A speech synthesis apparatus, the apparatus comprising:

the voice analysis module is used for acquiring the voice quality of the input voice when the input voice of the user is detected;

the parameter determining module is used for determining user attribute information of the user according to the voice characteristics, wherein the user attribute information comprises age, gender, region and cultural degree; acquiring voice parameters corresponding to the user attribute information as voice parameters for broadcasting voice, wherein the voice parameters are used for synthesizing target voice for broadcasting aiming at text information to be broadcasted, and the text information to be broadcasted is reply text information determined based on the input voice;

The information adding module is used for adding identification information into the text information to be broadcasted based on the grammar analysis of the text information to be broadcasted to obtain target text information, wherein the identification information is interactive text or nonsensical phrase;

the voice generation module is used for generating target voice for broadcasting based on the voice parameters and the target text information, and broadcasting accents of the target voice are consistent with speaking accents corresponding to the affiliated areas; detecting the times of interrupting the voice of the user in a preset time period for broadcasting the target voice; if the number of times is greater than a preset number of times, acquiring preset inquiry text information, and generating voice for broadcasting based on the preset inquiry text information, wherein the preset inquiry text information is used for prompting the user to input voice; and if the voice quality is lower than the preset quality threshold, acquiring second text information, and generating target voice for broadcasting based on the second text information, wherein the second text information is used for prompting the user to increase the volume or reenter voice far from noise.

10. An electronic device, comprising:

One or more processors;

a memory;

one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs configured to perform the method of any of claims 1-8.

11. A computer readable storage medium, characterized in that the computer readable storage medium has stored therein a program code, which is callable by a processor for performing the method according to any one of claims 1-8.